1 Introduction

1.1 Problem Statement

The objective of this study is to investigate the negative health effects associated with smoking, including both physical and mental health outcomes. Given the widespread prevalence of smoking and the rising use of electronic nicotine products, it is crucial to understand the extent to which these habits impact overall health. Our analysis will focus on identifying the associations between smoking and various health-related issues, as well as exploring the relationship between smoking habits and mental health indicators. Additionally, we will compare these findings with the health effects associated with the use of electronic nicotine products.

The central question we want to answer is: what are the negative health effects associated with smoking? We will look at the association between smoking and health related issues. We will also look at some mental health indicators and how they relate to smoking habits. We will also make an interesting comparison with usage of electronic nicotine products.

1.2 Abstract

The negative health effects of smoking have been well-documented, yet the comprehensive impact on both physical and mental health continues to be a critical area of study. This research aims to analyze the associations between smoking and health-related issues using data from the 2021 Medical Expenditure Panel Survey Household Component, which surveyed 15,000 households in the United States. By examining physical health outcomes and mental health indicators, this study seeks to provide a detailed understanding of how smoking affects individuals. Furthermore, we will compare these effects with those associated with the use of electronic nicotine products, offering insights into the potential risks of these increasingly popular alternatives. Our findings aim to inform public health strategies and contribute to ongoing efforts to mitigate the health risks associated with smoking and electronic nicotine product usage.

The dataset we are working with is the Medical Expenditure Panel Survey Household component of 2021. It’s a survey conducted on 15000 households in the United States of America.

Here is info about characteristics: https://meps.ahrq.gov/data_files/publications/st554/stat554.shtml

Here is questionary report: https://meps.ahrq.gov/survey_comp/hc_survey/paper_quest/2021/2021_SHE_SAQ.pdf

Codebook: https://meps.ahrq.gov/data_stats/download_data/pufs/h233/h233cb.pdf

# the dataset is in the github repo, we don't need to download it every time
h233 <- read_dta("h233.dta")
head(h233)
## # A tibble: 6 × 1,488
##      DUID   PID DUPERSID   PANEL         FAMID31 FAMID42 FAMID53 FAMID21 FAMIDYR
##     <dbl> <dbl> <chr>      <dbl+lbl>     <chr>   <chr>   <chr>   <chr>   <chr>  
## 1 2320005   101 2320005101 23 [23 PANEL… A       A       A       A       A      
## 2 2320005   102 2320005102 23 [23 PANEL… A       A       A       A       A      
## 3 2320006   101 2320006101 23 [23 PANEL… A       A       A       A       A      
## 4 2320006   102 2320006102 23 [23 PANEL… B       B       B       B       B      
## 5 2320006   103 2320006103 23 [23 PANEL… A       A       A       A       A      
## 6 2320012   102 2320012102 23 [23 PANEL… A       A       A       A       A      
## # ℹ 1,479 more variables: CPSFAMID <chr>, FCSZ1231 <dbl+lbl>,
## #   FCRP1231 <dbl+lbl>, RULETR31 <chr>, RULETR42 <chr>, RULETR53 <chr>,
## #   RULETR21 <chr>, RUSIZE31 <dbl+lbl>, RUSIZE42 <dbl+lbl>, RUSIZE53 <dbl+lbl>,
## #   RUSIZE21 <dbl+lbl>, RUCLAS31 <dbl+lbl>, RUCLAS42 <dbl+lbl>,
## #   RUCLAS53 <dbl+lbl>, RUCLAS21 <dbl+lbl>, FAMSZE31 <dbl+lbl>,
## #   FAMSZE42 <dbl+lbl>, FAMSZE53 <dbl+lbl>, FAMSZE21 <dbl+lbl>,
## #   FMRS1231 <dbl+lbl>, FAMS1231 <dbl+lbl>, FAMSZEYR <dbl+lbl>, …

Our main explanatory variables:

ADSMOK42: currently smokes

SDENICPROD: used electronic nicotine product

The other variables we selected are:

ADNSMK42 - If ADSMOK42 = 1, doctor advised you to quit smoking

Other variables we selected can be grouped as follows:

Demographic variables: AGELAST, REGION21, SEX, RACEV2X, MARRY21X, FTSTU21X, TTLP21X, FAMINC21, EMPST31

Mental health: ADDPRS42, ADHOPE42, ADNERV42, ADPCFL42, ADSAD42, MNHLTH31

General health: ADGENH42, ADENGY42, RTHLTH31, WLKLIM53

Disease Diagnosis: ASTHDX, ARTHDX, CALUNG, CHOLDX, CHBRON31, DIABDX_M18, EMPHDX, HIBPDX, STRKDX

vars_to_keep = c("ADSMOK42","SDENICPROD","ADDPRS42","ADGENH42",
                 "ADENGY42","ADHOPE42","ADNERV42","ADPCFL42","ADSAD42",
                 "AGELAST","ASTHDX","CALUNG","CHOLDX","DIABDX_M18","EMPHDX",
                 "FAMINC21","HIBPDX","MNHLTH31","RTHLTH31","TTLP21X",
                 "WLKLIM53","SDCOMPAN","SDLEFTOUT","SDLIFE", 'REGION21', 'RACEV2X', 
                 'FTSTU21X', 'SEX','STRKDX','HIBPDX', 'MARRY21X','CHBRON31' ,'CANCERDX', 
                 'ARTHDX', 'ADNSMK42', 'SDHOME', 'SDMEDCARE','SDHLTHFOOD', 'SDISOL',
                 'SDHMDEPR','SDHMALC','SDHMDRG', 'SDHMDIV', 'SDHMBEAT', 'SDHURTCHLD', 'EMPST31')

small_df = h233 %>% select(all_of(vars_to_keep)) %>% as_factor()
#turning the variables into factors

small_df = small_df %>% mutate(age_group = cut(AGELAST,
                                    breaks = c(-Inf,18, 25, 35, 45, 65, Inf),
                  labels = c("0-18","18-24", "25-34", "35-44", "45-64", "65+"),
                  right = FALSE))
#adding a new variable to break into age groups

small_df %>% summary()
##                    ADSMOK42                      SDENICPROD   
##  -15 CANNOT BE COMPUTED:    0   -15 CANNOT BE COMPUTED:  216  
##  -8 DK                 :    0   -1 INAPPLICABLE       : 9936  
##  -7 REFUSED            :    0   1 YES                 : 2419  
##  -1 INAPPLICABLE       :12914   2 NO                  :15765  
##  1 YES                 : 2831                                 
##  2 NO                  :12591                                 
##                                                               
##                       ADDPRS42                       ADGENH42    
##  -15 CANNOT BE COMPUTED   :  225   -15 CANNOT BE COMPUTED:  482  
##  -1 INAPPLICABLE          :12914   -1 INAPPLICABLE       :12914  
##  0 NOT AT ALL             :11473   1 EXCELLENT           : 2461  
##  1 SEVERAL DAYS           : 2673   2 VERY GOOD           : 5284  
##  2 MORE THAN HALF THE DAYS:  665   3 GOOD                : 4888  
##  3 NEARLY EVERY DAY       :  386   4 FAIR                : 1953  
##                                    5 POOR                :  354  
##                      ADENGY42                       ADHOPE42    
##  -1 INAPPLICABLE         :12914   -15 CANNOT BE COMPUTED:  228  
##  2 MOST OF THE TIME      : 5565   -1 INAPPLICABLE       :12914  
##  3 A GOOD BIT OF THE TIME: 2923   0 NONE OF THE TIME    :11623  
##  4 SOME OF THE TIME      : 2634   1 LITTLE OF THE TIME  : 2009  
##  1 ALL OF THE TIME       : 1744   2 SOME OF THE TIME    : 1033  
##  5 LITTLE OF THE TIME    : 1361   3 MOST OF THE TIME    :  381  
##  (Other)                 : 1195   4 ALL OF THE TIME     :  148  
##                    ADNERV42                         ADPCFL42    
##  -15 CANNOT BE COMPUTED:  187   -1 INAPPLICABLE         :12914  
##  -1 INAPPLICABLE       :12914   2 MOST OF THE TIME      : 6314  
##  0 NONE OF THE TIME    : 7832   1 ALL OF THE TIME       : 2639  
##  1 LITTLE OF THE TIME  : 4507   3 A GOOD BIT OF THE TIME: 2636  
##  2 SOME OF THE TIME    : 2107   4 SOME OF THE TIME      : 2061  
##  3 MOST OF THE TIME    :  605   5 LITTLE OF THE TIME    :  885  
##  4 ALL OF THE TIME     :  184   (Other)                 :  887  
##                    ADSAD42         AGELAST                        ASTHDX     
##  -15 CANNOT BE COMPUTED:  219   Min.   : 0.0   -15 CANNOT BE COMPUTED:   94  
##  -1 INAPPLICABLE       :12914   1st Qu.:23.0   -8 DK                 :   20  
##  0 NONE OF THE TIME    :12089   Median :44.0   -7 REFUSED            :   23  
##  1 LITTLE OF THE TIME  : 1837   Mean   :43.2   -1 INAPPLICABLE       :   31  
##  2 SOME OF THE TIME    :  855   3rd Qu.:64.0   1 YES                 : 4022  
##  3 MOST OF THE TIME    :  299   Max.   :85.0   2 NO                  :24146  
##  4 ALL OF THE TIME     :  123                                                
##                     CALUNG                         CHOLDX     
##  -15 CANNOT BE COMPUTED:    0   -15 CANNOT BE COMPUTED:  186  
##  -8 DK                 :   11   -8 DK                 :   43  
##  -7 REFUSED            :    1   -7 REFUSED            :   23  
##  -1 INAPPLICABLE       :25428   -1 INAPPLICABLE       : 5553  
##  1 YES                 :   88   1 YES                 : 7925  
##  2 NO                  : 2808   2 NO                  :14606  
##                                                               
##                   DIABDX_M18                       EMPHDX         FAMINC21     
##  -15 CANNOT BE COMPUTED:    0   -15 CANNOT BE COMPUTED:  185   Min.   :-88503  
##  -8 DK                 :   22   -8 DK                 :   18   1st Qu.: 26872  
##  -7 REFUSED            :   23   -7 REFUSED            :   23   Median : 58000  
##  -1 INAPPLICABLE       :  125   -1 INAPPLICABLE       : 5553   Mean   : 81004  
##  1 YES                 : 3265   1 YES                 :  483   3rd Qu.:110172  
##  2 NO                  :24901   2 NO                  :22074   Max.   :663883  
##                                                                                
##                     HIBPDX                 MNHLTH31               RTHLTH31   
##  -15 CANNOT BE COMPUTED:  186   1 EXCELLENT    :8786   2 VERY GOOD    :8831  
##  -8 DK                 :   37   2 VERY GOOD    :8604   1 EXCELLENT    :7801  
##  -7 REFUSED            :   22   3 GOOD         :7595   3 GOOD         :7556  
##  -1 INAPPLICABLE       : 5553   4 FAIR         :2186   4 FAIR         :2824  
##  1 YES                 : 8538   -1 INAPPLICABLE: 573   5 POOR         : 681  
##  2 NO                  :14000   5 POOR         : 512   -1 INAPPLICABLE: 573  
##                                 (Other)        :  80   (Other)        :  70  
##     TTLP21X                  WLKLIM53                       SDCOMPAN   
##  Min.   :-88503   -8 DK          :    4   -1 INAPPLICABLE       :9936  
##  1st Qu.:     0   -7 REFUSED     :   29   1 NEVER               :7135  
##  Median : 20000   -1 INAPPLICABLE:13428   3 SOMETIMES           :4724  
##  Mean   : 34920   1 YES          : 1811   2 RARELY              :4596  
##  3rd Qu.: 49513   2 NO           :13064   4 OFTEN               :1816  
##  Max.   :390565                           -15 CANNOT BE COMPUTED: 124  
##                                           (Other)               :   5  
##                   SDLEFTOUT                       SDLIFE    
##  -1 INAPPLICABLE       :9936   -1 INAPPLICABLE       :9936  
##  1 NEVER               :6635   2 VERY SATISFIED      :8106  
##  2 RARELY              :5883   3 SOMEWHAT SATISFIED  :4892  
##  3 SOMETIMES           :4570   1 COMPLETELY SATISFIED:4007  
##  4 OFTEN               :1178   4 A LITTLE SATISFIED  : 881  
##  -15 CANNOT BE COMPUTED: 130   5 NOT AT ALL SATISFIED: 265  
##  (Other)               :   4   (Other)               : 249  
##             REGION21                                            RACEV2X     
##  -1 INAPPLICABLE:  233   1 WHITE - NO OTHER RACE REPORTED           :21262  
##  1 NORTHEAST    : 4403   2 BLACK - NO OTHER RACE REPORTED           : 4232  
##  2 MIDWEST      : 5434   12 MULTIPLE RACES REPORTED                 : 1059  
##  3 SOUTH        :10738   10 OTH ASIAN/NATV HAWAIIAN/PACFC ISL-NO OTH:  574  
##  4 WEST         : 7528   4 ASIAN INDIAN - NO OTHER RACE REPORTED    :  409  
##                          5 CHINESE - NO OTHER RACE REPORTED         :  304  
##                          (Other)                                    :  496  
##                    FTSTU21X                         SEX       
##  -15 CANNOT BE COMPUTED:    0   -15 CANNOT BE COMPUTED:    0  
##  -8 DK                 :    0   -8 DK                 :    0  
##  -7 REFUSED            :    3   -7 REFUSED            :    0  
##  -1 INAPPLICABLE       :26037   -1 INAPPLICABLE       :    0  
##  1 FULL-TIME STUDENT   : 1091   1 MALE                :13423  
##  2 PART-TIME STUDENT   :  137   2 FEMALE              :14913  
##  3 NOT A STUDENT       : 1068                                 
##                     STRKDX                               MARRY21X    
##  -15 CANNOT BE COMPUTED:  185   1 MARRIED                    :10526  
##  -8 DK                 :   19   5 NEVER MARRIED              : 7087  
##  -7 REFUSED            :   23   6 UNDER AGE 16 - INAPPLICABLE: 4844  
##  -1 INAPPLICABLE       : 5553   3 DIVORCED                   : 3261  
##  1 YES                 : 1143   2 WIDOWED                    : 2058  
##  2 NO                  :21413   4 SEPARATED                  :  548  
##                                 (Other)                      :   12  
##                    CHBRON31                       CANCERDX    
##  -15 CANNOT BE COMPUTED:    2   -15 CANNOT BE COMPUTED:  189  
##  -8 DK                 :   67   -8 DK                 :   17  
##  -7 REFUSED            :   54   -7 REFUSED            :   24  
##  -1 INAPPLICABLE       : 6144   -1 INAPPLICABLE       : 5553  
##  1 YES                 :  337   1 YES                 : 2908  
##  2 NO                  :21732   2 NO                  :19645  
##                                                               
##                     ARTHDX                                       ADNSMK42    
##  -15 CANNOT BE COMPUTED:  185   -15 CANNOT BE COMPUTED               :   38  
##  -8 DK                 :   27   -8 DK                                :    0  
##  -7 REFUSED            :   24   -7 REFUSED                           :    0  
##  -1 INAPPLICABLE       : 5553   -1 INAPPLICABLE                      :25505  
##  1 YES                 : 6666   1 YES                                : 1104  
##  2 NO                  :15881   2 NO                                 : 1265  
##                                 3 HAD NO VISITS IN THE LAST 12 MONTHS:  424  
##                     SDHOME               SDMEDCARE              SDHLTHFOOD  
##  -1 INAPPLICABLE       :9936   -1 INAPPLICABLE:9936   -1 INAPPLICABLE:9936  
##  2 VERY SATISFIED      :7330   2 VERY GOOD    :6245   2 VERY GOOD    :5954  
##  1 COMPLETELY SATISFIED:5970   1 EXCELLENT    :6152   1 EXCELLENT    :5526  
##  3 SOMEWHAT SATISFIED  :3598   3 GOOD         :3938   3 GOOD         :4159  
##  4 A LITTLE SATISFIED  : 814   4 FAIR         :1364   4 FAIR         :1711  
##  5 NOT AT ALL SATISFIED: 444   5 POOR         : 439   5 POOR         : 707  
##  (Other)               : 244   (Other)        : 262   (Other)        : 343  
##                     SDISOL                       SDHMDEPR    
##  -1 INAPPLICABLE       :9936   -15 CANNOT BE COMPUTED:  226  
##  1 NEVER               :7206   -1 INAPPLICABLE       : 9936  
##  2 RARELY              :4987   1 YES                 : 4119  
##  3 SOMETIMES           :4415   2 NO                  :14055  
##  4 OFTEN               :1660                                 
##  -15 CANNOT BE COMPUTED: 131                                 
##  (Other)               :   1                                 
##                    SDHMALC                        SDHMDRG     
##  -15 CANNOT BE COMPUTED:  226   -15 CANNOT BE COMPUTED:  221  
##  -1 INAPPLICABLE       : 9936   -1 INAPPLICABLE       : 9936  
##  1 YES                 : 4472   1 YES                 : 1957  
##  2 NO                  :13702   2 NO                  :16222  
##                                                               
##                                                               
##                                                               
##                    SDHMDIV                          SDHMBEAT    
##  -15 CANNOT BE COMPUTED:  242   -15 CANNOT BE COMPUTED  :  257  
##  -1 INAPPLICABLE       : 9936   -1 INAPPLICABLE         : 9936  
##  1 YES                 : 4437   1 NEVER                 :14647  
##  2 NO                  :12550   2 ONCE                  : 1208  
##  8 PARENTS NOT MARRIED : 1171   3 MORE THAN ONCE        : 2286  
##                                 4 NEVER OR ONCE         :    0  
##                                 5 ONCE OR MORE THAN ONCE:    2  
##                     SDHURTCHLD                                     EMPST31     
##  -15 CANNOT BE COMPUTED  :  246   1 EMPLOYED AT RD 3/1 INT DATE        :11989  
##  -1 INAPPLICABLE         : 9936   4 NOT EMPLOYED DURING RD 3/1         :10186  
##  1 NEVER                 :14042   -1 INAPPLICABLE                      : 5456  
##  2 ONCE                  : 1415   3 JOB DURING RD 3/1 REF PERIOD       :  267  
##  3 MORE THAN ONCE        : 2695   -7 REFUSED                           :  159  
##  4 NEVER OR ONCE         :    2   2 JOB TO RETURN TO AT RD 3/1 INT DATE:  132  
##  5 ONCE OR MORE THAN ONCE:    0   (Other)                              :  147  
##  age_group   
##  0-18 :5557  
##  18-24:1988  
##  25-34:3306  
##  35-44:3479  
##  45-64:7299  
##  65+  :6707  
## 

2 Descriptive statistics

2.1 Demographic characteristics

We start by looking at the age distribution of people that took the survey

#smokers by age group
small_df %>% select(ADSMOK42, age_group) %>% group_by(age_group) %>%
  summarise(n_smokers = sum(ADSMOK42 == "1 YES"),
            n_non_smokers = sum(ADSMOK42 == "2 NO"),
            n_inapplicable = sum(ADSMOK42 == "-1 INAPPLICABLE")) %>%
  mutate(perc_smokers = 100*n_smokers/n_non_smokers) %>% 
  slice(-1) %>%
  ggplot(aes(x = age_group)) +
  geom_col(aes(y = n_non_smokers, fill = "non smokers")) +
  geom_col(aes(y = n_smokers, fill = "smokers")) +
  geom_text(aes(label=paste(round(perc_smokers,2),"%"), y = 130)) +
  labs(title = "Age distribution of smokers and non smokers",
       y = "count", x = "age group", fill = "") + theme_classic()

# age distribution of users of electronic cigarettes 
small_df %>% select(SDENICPROD, age_group) %>% group_by(age_group) %>%
  summarise(n_smokers = sum(SDENICPROD == "1 YES"),
            n_non_smokers = sum(SDENICPROD == "2 NO"),
            n_inapplicable = sum(SDENICPROD == "-1 INAPPLICABLE")) %>%
  mutate(perc_smokers = 100*n_smokers/n_non_smokers) %>% 
  slice(-1) %>%
  ggplot(aes(x = age_group)) +
  geom_col(aes(y = n_non_smokers, fill = "non users")) +
  geom_col(aes(y = n_smokers, fill = "users")) +
  geom_text(aes(label=paste(round(perc_smokers,2),"%"), y = 130)) +
  labs(title = "Age distribution of electronic nicotine product usage",
       y = "count", x = "age group", fill = "") + theme_classic()

Smoking rate increases with age, till the 45-64 age group. They are the most likely to smoke. However the last age group, 65 and over is the one with the lowest rate. This fact is key to understand the effect of age: as is well known, general health decreases sharply with old age, so it will be important in our analysis to always control for the confounding effect of age.

Looking at electronic nicotine usage, we see a very different picture: in the youngest group there is a really high rate, 41%. Almost one every two young adults uses those products. The rate is decreasing with age: every older generation show a large decrease from the younger one. The 65+ group, which as we have already said is the one with the worst health, sees almost no usage: just over 5%.

The two variables will allow us to make interesting comparisons since the makeup of the groups is very different.

small_df <- small_df %>%
  mutate(
    sex_group = case_when(
      SEX == '1 MALE' ~ 'MALE',
      SEX == '2 FEMALE' ~ 'FEMALE',
      TRUE ~ NA_character_
    )
  )
smokers_df <- small_df %>% filter(ADSMOK42 == "1 YES")

non_smokers_df <- small_df %>% filter(ADSMOK42 == "2 NO")

e_smokers_df <-small_df %>% filter(SDENICPROD == "1 YES")

2.1.1 Difference between smokers and non-smokers

create_plots_demogr <- function(data, demographic_variables, titles) {
  data <- data %>%
    mutate(
      tot_inc_group = cut(TTLP21X,
                          breaks = c(-Inf, 10000, 15000, 20000, 25000, 30000, Inf),
                          labels = c("0-10000", "10001-15000", "15001-20000", "20001-25000", "25001-30000", "30000+"),
                          right = FALSE),
      fam_inc_group = cut(FAMINC21,
                          breaks = c(-Inf, 10000, 15000, 20000, 25000, 30000, Inf),
                          labels = c("0-10000", "10001-15000", "15001-20000", "20001-25000", "25001-30000", "30000+"),
                          right = FALSE)
    ) %>%
    filter(!is.na(tot_inc_group) & !is.na(fam_inc_group) & !is.na(sex_group)) %>%
    filter(tot_inc_group != "NA")
  
  create_plots <- function(data, variable, title) {
    my_colors <- c("skyblue", "salmon", "lightgreen", "lightpink", "lightblue", "lightcoral")
    barplot(table(data[[variable]]), main = title, xlab = variable, ylab = "Frequency", las = 2, cex.names = 0.8, col = my_colors)
  }
  
  par(mfrow = c(1, 3)) 
  for (variable in demographic_variables) {
    create_plots(data, variable, titles[[variable]])
  }
}

demographic_variables <- c("age_group", "REGION21", "sex_group", "RACEV2X", "MARRY21X", "FTSTU21X", "tot_inc_group", "fam_inc_group", "EMPST31")

titles_smokers <- c(
  age_group = "Age Group Distribution (Smokers)",
  REGION21 = "Region Distribution (Smokers)",
  sex_group = "Gender Distribution (Smokers)",
  RACEV2X = "Race Distribution (Smokers)",
  MARRY21X = "Marital Status Distribution (Smokers)",
  FTSTU21X = "Full-Time Student Distribution (Smokers)",
  tot_inc_group = "Total Income Distribution (Smokers)",
  fam_inc_group = "Family Income Distribution (Smokers)",
  EMPST31 = "Employment Status Distribution (Smokers)"
)

titles_non_smokers <- c(
  age_group = "Age Group Distribution (Non-Smokers)",
  REGION21 = "Region Distribution (Non-Smokers)",
  sex_group = "Gender Distribution (Non-Smokers)",
  RACEV2X = "Race Distribution (Non-Smokers)",
  MARRY21X = "Marital Status Distribution (Non-Smokers)",
  FTSTU21X = "Full-Time Student Distribution (Non-Smokers)",
  tot_inc_group = "Total Income Distribution (Non-Smokers)",
  fam_inc_group = "Family Income Distribution (Non-Smokers)",
  EMPST31 = "Employment Status Distribution (Non-Smokers)"
)


create_plots_demogr(smokers_df, demographic_variables, titles_smokers)

create_plots_demogr(non_smokers_df, demographic_variables, titles_non_smokers)

Highlights A collective image of smokers: * 40% are 45-64 y.o. * 44% living in South region, ~22% in Midwest and ~22% in West * 49% of male and 51% of female * 70% of white race and 22% of black * 32% never married, 31% married, just 21% divorced and 10% widowed * 4% not a students, 1% full-time student, 0,31% part-time students * Mean of total income is 30586, ~51% earn 30000+, ~25% earn less than 10000 * Mean of total income of family is 50665, 55% earn 30000+, other family’s income distributed gradually in income range * 50% are not employed during RD, 46% employed

A collective image of non-smokers: * 36% are 65+ y.o., 31% are 45-64 y.o. * 37% living in South region, 22% in Midwest and ~26% in West * 56% of male and 44% of female * 79% of white race and 11% of black * 24% never married, 50% married, just 14% divorced and 10% widowed * 4% not a students, 2,6 % full-time student, 0,04% part-time students * Mean of total income is 47740, ~55% earn 30000+ * Mean of total income of family is 85484 , 44% earn 30000+,other family’s income distributed gradually in income range * 45% are not employed during RD, 53% employed

Did smokers receive a doctor’s recommendation to quit?

ggplot(smokers_df, aes(x = ADNSMK42)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Distribution of Smokers by Doctor's Recommendation", x = "Doctor's Recommendation", y = "Frequency") + theme_classic()

38% of current smokers still smokes after doctor’s advice to quit smoke

Smokers that also make use of electronic nicotine products

ggplot(smokers_df, aes(x = SDENICPROD == "1 YES")) +
  geom_bar(fill = "lightcoral") +
  labs(title = "Distribution of e-Smokers through current smokers", x = "E-smokers No/Yes", y = "Count") + theme_classic()

Current smokers make use of e-cigarettes at only 21% rate

This is the main reason why we looked at the two explanatory variables separately: the group that uses both is really small with only 700 people in a sample of 28,000 people

2.2 Mental Health Variables

#mental health
create_plots_mental_health <- function(data, mental_health_variables, titles) {
  data <- data %>%
    filter(!is.na(sex_group))
  
 create_plots <- function(data, variable, title) {
    my_colors <- c("skyblue", "salmon", "lightgreen", "lightpink", "lightblue", "lightcoral")
    barplot(table(data[[variable]]), main = title, xlab = variable, ylab = "Frequency", las = 2, cex.names = 0.8, col = my_colors)
  }
  
  par(mfrow = c(1, 3)) 
  for (variable in mental_health_variables) {
    create_plots(data, variable, titles[[variable]])
  }
}

mental_health_variables <- c("ADDPRS42", "ADHOPE42", "ADNERV42", "ADPCFL42", "ADSAD42", "MNHLTH31")


titles_smokers_mental_health <- c(
  ADDPRS42 = "Perceived Stress (Smokers)",
  ADHOPE42 = "Hopefulness (Smokers)",
  ADNERV42 = "Nervousness (Smokers)",
  ADPCFL42 = "Psychological Fatigue (Smokers)",
  ADSAD42 = "Sadness (Smokers)",
  MNHLTH31 = "Mental Health Status (Smokers)"
)

titles_non_smokers_mental_health <- c(
  ADDPRS42 = "Perceived Stress (Non-Smokers)",
  ADHOPE42 = "Hopefulness (Non-Smokers)",
  ADNERV42 = "Nervousness (Non-Smokers)",
  ADPCFL42 = "Psychological Fatigue (Non-Smokers)",
  ADSAD42 = "Sadness (Non-Smokers)",
  MNHLTH31 = "Mental Health Status (Non-Smokers)"
)

titles_e_smokers_mental_health <- c(
  ADDPRS42 = "Perceived Stress (E-Smokers)",
  ADHOPE42 = "Hopefulness (E-Smokers)",
  ADNERV42 = "Nervousness (E-Smokers)",
  ADPCFL42 = "Psychological Fatigue (E-Smokers)",
  ADSAD42 = "Sadness (E-Smokers)",
  MNHLTH31 = "Mental Health Status (E-Smokers)"
)
create_plots_mental_health(smokers_df, mental_health_variables, titles_smokers_mental_health)

create_plots_mental_health(non_smokers_df, mental_health_variables, titles_non_smokers_mental_health)

create_plots_mental_health(e_smokers_df, mental_health_variables, titles_e_smokers_mental_health)

Highlights

  • Mostly smokers don’t perceived stress, hopefulness and sadness, nervousness mostly also no, but 25% little of the time. 36% of smokers little of the time have psychological fatigue. 35% smokers have a goog mental health, 24% very good mental health, 21% excellent and just 14% fair.

*In comparison, non-smokers mostly don’t feel perceived stress, hopefulness and sadness. However, that mental health status is better: 40% have very good, 28% excellent and 28% good, just 7% fair.

*Another picture for e-smokers: mostly of them don’t have perceived stress and hopefulness, but 24% fill little of time nervous and 15% some of the time. 27% little of the time have psychological fatigue, ~16% have it most of the time. Overall, 14% feel fair, mental health is more than less good or excellent.

2.3 General Health Variables

create_plots_general_health <- function(data, general_health_variables, titles) {
  data <- data %>%
    filter(!is.na(sex_group))
  
create_plots <- function(data, variable, title) {
    my_colors <- c("skyblue", "salmon", "lightgreen", "lightpink", "lightblue", "lightcoral")
    barplot(table(data[[variable]]), main = title, xlab = variable, ylab = "Frequency", las = 2, cex.names = 0.8, col = my_colors)
  }
  
  par(mfrow = c(1, 2)) 
  for (variable in general_health_variables) {
    create_plots(data, variable, titles[[variable]])
  }
}

general_health_variables <- c("ADGENH42", "ADENGY42", "RTHLTH31", "WLKLIM53")


titles_smokers_general_health <- c(
  ADGENH42 = "General Health Perception (Smokers)",
  ADENGY42 = "Energy Level (Smokers)",
  RTHLTH31 = "Overall Health Rating (Smokers)",
  WLKLIM53 = "Walking Limitation (Smokers)"
)

titles_non_smokers_general_health <- c(
  ADGENH42 = "General Health Perception (Non-Smokers)",
  ADENGY42 = "Energy Level (Non-Smokers)",
  RTHLTH31 = "Overall Health Rating (Non-Smokers)",
  WLKLIM53 = "Walking Limitation (Non-Smokers)"
)

titles_e_smokers_general_health <- c(
  ADGENH42 = "General Health Perception (E-Smokers)",
  ADENGY42 = "Energy Level (E-Smokers)",
  RTHLTH31 = "Overall Health Rating (E-Smokers)",
  WLKLIM53 = "Walking Limitation (E-Smokers)"
)


create_plots_general_health(smokers_df, general_health_variables, titles_smokers_general_health)

create_plots_general_health(non_smokers_df, general_health_variables, titles_non_smokers_general_health)

create_plots_general_health(e_smokers_df, general_health_variables, titles_e_smokers_general_health)

Highlights:

  • General health perception of smokers and non-smokers quite similar (good or very-good mostly).Same for energy level and overall health, 38% of them don’t have a walking limitations.

*Little another picture for e-smokers.Their energy level some times lower and distribution on overall health rate is higher tan for smokers and non-smokers. 45% don’t have a walking limitations

2.4 Diseases Diagnosis

#deseases
create_plots_health_disease <- function(data, health_disease_variables, titles) {
  data <- data %>%
    filter(!is.na(sex_group))
  
create_plots <- function(data, variable, title) {
    my_colors <- c("skyblue", "salmon", "lightgreen", "lightpink", "lightblue", "lightcoral")
    vector_to_plot = data %>% filter(.data[[variable]] %in% c("1 YES", "2 NO")) %>%
      pull(variable) %>% factor()
    
    barplot(table(vector_to_plot), main = title, xlab = variable, ylab = "Frequency", las = 2, cex.names = 0.8, col = my_colors)
  }
  
  par(mfrow = c(1, 3)) 
  for (variable in health_disease_variables) {
    create_plots(data, variable, titles[[variable]])
  }
}

health_disease_variables <- c("ASTHDX", "ARTHDX", "CALUNG", "CHOLDX", "CHBRON31", "DIABDX_M18", "EMPHDX", "HIBPDX", "STRKDX")

titles_smokers_health_disease <- c(
  ASTHDX = "Asthma Diagnosis (Smokers)",
  ARTHDX = "Arthritis Diagnosis (Smokers)",
  CALUNG = "Lung Cancer Diagnosis (Smokers)",
  CHOLDX = "Cholesterol Diagnosis (Smokers)",
  CHBRON31 = "Chronic Bronchitis Diagnosis (Smokers)",
  DIABDX_M18 = "Diabetes Diagnosis (Smokers)",
  EMPHDX = "Emphysema Diagnosis (Smokers)",
  HIBPDX = "High Blood Pressure Diagnosis (Smokers)",
  OHRTDX = "Heart Disease Diagnosis (Smokers)",
  OHRTTYPE = "Type of Heart Disease Diagnosis (Smokers)",
  STRKDX = "Stroke Diagnosis (Smokers)"
)

titles_non_smokers_health_disease <- c(
  ASTHDX = "Asthma Diagnosis (Non-Smokers)",
  ARTHDX = "Arthritis Diagnosis (Non-Smokers)",
  CALUNG = "Lung Cancer Diagnosis (Non-Smokers)",
  CHOLDX = "Cholesterol Diagnosis (Non-Smokers)",
  CHBRON31 = "Chronic Bronchitis Diagnosis (Non-Smokers)",
  DIABDX_M18 = "Diabetes Diagnosis (Non-Smokers)",
  EMPHDX = "Emphysema Diagnosis (Non-Smokers)",
  HIBPDX = "High Blood Pressure Diagnosis (Non-Smokers)",
  OHRTDX = "Heart Disease Diagnosis (Non-Smokers)",
  OHRTTYPE = "Type of Heart Disease Diagnosis (Non-Smokers)",
  STRKDX = "Stroke Diagnosis (Non-Smokers)"
)

create_plots_health_disease(smokers_df, health_disease_variables, titles_smokers_health_disease)

create_plots_health_disease(non_smokers_df, health_disease_variables, titles_non_smokers_health_disease)

We immediately see that some diseases are very rare, while others are quite common.

condition_labels <- c(
  "Asthma Diagnosis",
  "Arthritis Diagnosis",
  "Lung Cancer Diagnosis",
  "Cholesterol Diagnosis",
  "Chronic Bronchitis Diagnosis",
  "Diabetes Diagnosis",
  "Emphysema Diagnosis",
  "High Blood Pressure Diagnosis",
  "Stroke Diagnosis"
)



diseases_percentages = small_df %>% filter(AGELAST >= 18) %>%
  select(ADSMOK42, all_of(health_disease_variables)) %>% group_by(ADSMOK42) %>%
  summarise(ASTHDX = sum(ASTHDX == "1 YES")/sum(ASTHDX %in% c("1 YES","2 NO")),
            ARTHDX = sum(ARTHDX == "1 YES")/sum(ARTHDX %in% c("1 YES","2 NO")),
            CALUNG = sum(CALUNG == "1 YES")/sum(CALUNG %in% c("1 YES","2 NO")),
            CHOLDX = sum(CHOLDX == "1 YES")/sum(CHOLDX %in% c("1 YES","2 NO")),
            CHBRON31 = sum(CHBRON31 == "1 YES")/sum(CHBRON31 %in% c("1 YES","2 NO")),
            DIABDX_M18 = sum(DIABDX_M18 == "1 YES")/sum(DIABDX_M18 %in% c("1 YES","2 NO")),
            EMPHDX = sum(EMPHDX == "1 YES")/sum(EMPHDX %in% c("1 YES","2 NO")),
            HIBPDX = sum(HIBPDX == "1 YES")/sum(HIBPDX %in% c("1 YES","2 NO")),
            STRKDX = sum(STRKDX == "1 YES")/sum(STRKDX %in% c("1 YES","2 NO"))) %>%
  pivot_longer(cols = health_disease_variables) %>%
  rename(percentage = value , condition = name) %>%
  mutate(condition = factor(condition,
                            levels = health_disease_variables,
                            labels = condition_labels))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(health_disease_variables)
## 
##   # Now:
##   data %>% select(all_of(health_disease_variables))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
diseases_percentages %>% filter(ADSMOK42 == "1 YES") %>%
  ggplot(aes(x = condition, y = percentage*100, fill = condition)) +
    geom_col() +
    coord_flip() +
    labs(
      title = "Percentage of Smokers 18 or Older \nWho Report Specific Health Diseases",
      x = "",
      y = "Percentage"
    ) +
    theme_classic() +
    theme(legend.position = "none")

diseases_percentages %>% filter(ADSMOK42 == "2 NO") %>%
  ggplot(aes(x = condition, y = percentage*100, fill = condition)) +
    geom_col() +
    coord_flip() +
    labs(
      title = "Percentage of non-Smokers 18 or Older \nWho Report Specific Health Diseases",
      x = "",
      y = "Percentage"
    ) +
    theme_classic() +
    theme(legend.position = "none")

diseases_percentages %>% filter(ADSMOK42 != "-1 INAPPLICABLE") %>%
  mutate(ADSMOK42 = factor(ADSMOK42, levels = c("1 YES","2 NO"), 
                           labels = c("Smokers","Non-Smokers"))) %>%
  mutate(ADSMOK42 = relevel(ADSMOK42, "Non-Smokers")) %>%
  ggplot(aes(x = condition, y = percentage*100, fill = ADSMOK42)) +
    geom_col(position = position_dodge()) +
    coord_flip() +
    labs(
      title = "Percentage of people 18 or Older Who Report Specific Diseases",
      x = "",
      y = "Percentage"
    ) +
  theme(legend.title = element_blank())

# recreate the plot above but with electronic nicotine

diseases_percentages = small_df %>% filter(AGELAST >= 18) %>%
  select(SDENICPROD, all_of(health_disease_variables)) %>% group_by(SDENICPROD) %>%
  summarise(ASTHDX = sum(ASTHDX == "1 YES")/sum(ASTHDX %in% c("1 YES","2 NO")),
            ARTHDX = sum(ARTHDX == "1 YES")/sum(ARTHDX %in% c("1 YES","2 NO")),
            CALUNG = sum(CALUNG == "1 YES")/sum(CALUNG %in% c("1 YES","2 NO")),
            CHOLDX = sum(CHOLDX == "1 YES")/sum(CHOLDX %in% c("1 YES","2 NO")),
            CHBRON31 = sum(CHBRON31 == "1 YES")/sum(CHBRON31 %in% c("1 YES","2 NO")),
            DIABDX_M18 = sum(DIABDX_M18 == "1 YES")/sum(DIABDX_M18 %in% c("1 YES","2 NO")),
            EMPHDX = sum(EMPHDX == "1 YES")/sum(EMPHDX %in% c("1 YES","2 NO")),
            HIBPDX = sum(HIBPDX == "1 YES")/sum(HIBPDX %in% c("1 YES","2 NO")),
            STRKDX = sum(STRKDX == "1 YES")/sum(STRKDX %in% c("1 YES","2 NO"))) %>%
  pivot_longer(cols = health_disease_variables) %>%
  rename(percentage = value , condition = name) %>%
  mutate(condition = factor(condition,
                            levels = health_disease_variables,
                            labels = condition_labels))



diseases_percentages %>% filter(SDENICPROD %in% c("1 YES","2 NO")) %>%
  mutate(SDENICPROD = factor(SDENICPROD, levels = c("1 YES","2 NO"), 
                           labels = c("Users of E-nic. Prods","Non-Users"))) %>%
  mutate(SDENICPROD = relevel(SDENICPROD, "Non-Users")) %>%
  ggplot(aes(x = condition, y = percentage*100, fill = SDENICPROD)) +
    geom_col(position = position_dodge()) +
    coord_flip() +
    labs(
      title = "Disease Rates among Users of Electronic Nicotine Products",
      x = "",
      y = "Percentage"
    ) +
  theme(legend.title = element_blank())

3 Contingency Tables

3.1 Smoking

3.1.1 Smoking and Diseases

# contingency tables and chisquared test for
# ADSMOK42


#starting with ADSMOK42 and physical health

titles_health_disease <- c(
  ASTHDX = "Asthma Diagnosis",
  ARTHDX = "Arthritis Diagnosis",
  CALUNG = "Lung Cancer Diagnosis",
  CHOLDX = "Cholesterol Diagnosis",
  CHBRON31 = "Chronic Bronchitis Diagnosis",
  DIABDX_M18 = "Diabetes Diagnosis",
  EMPHDX = "Emphysema Diagnosis",
  HIBPDX = "High Blood Pressure Diagnosis",
  STRKDX = "Stroke Diagnosis"
)

for(variable in health_disease_variables){
  if("1 YES" %in% levels(small_df[[variable]])){
  print(titles_health_disease[variable])
  print("Contingency table")
  print(cont_tables(small_df, "ADSMOK42", variable)%>% round(digits=3))
  print("testing hypothesis of no association")
  print(cont_tables(small_df, "ADSMOK42", variable,F)%>% chisq.test(correct = F))
  print("------------")
  }

}
##             ASTHDX 
## "Asthma Diagnosis" 
## [1] "Contingency table"
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(explanatory_var)
## 
##   # Now:
##   data %>% select(all_of(explanatory_var))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(var_of_interest)
## 
##   # Now:
##   data %>% select(all_of(var_of_interest))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
##         ASTHDX
## ADSMOK42 1 YES  2 NO
##    1 YES 0.176 0.824
##    2 NO  0.146 0.854
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 15.62, df = 1, p-value = 7.743e-05
## 
## [1] "------------"
##                ARTHDX 
## "Arthritis Diagnosis" 
## [1] "Contingency table"
##         ARTHDX
## ADSMOK42 1 YES  2 NO
##    1 YES 0.355 0.645
##    2 NO  0.329 0.671
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 6.784, df = 1, p-value = 0.009198
## 
## [1] "------------"
##                  CALUNG 
## "Lung Cancer Diagnosis" 
## [1] "Contingency table"
##         CALUNG
## ADSMOK42 1 YES  2 NO
##    1 YES 0.044 0.956
##    2 NO  0.025 0.975
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 3.6873, df = 1, p-value = 0.05483
## 
## [1] "------------"
##                  CHOLDX 
## "Cholesterol Diagnosis" 
## [1] "Contingency table"
##         CHOLDX
## ADSMOK42 1 YES  2 NO
##    1 YES 0.406 0.594
##    2 NO  0.389 0.611
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 2.6921, df = 1, p-value = 0.1008
## 
## [1] "------------"
##                       CHBRON31 
## "Chronic Bronchitis Diagnosis" 
## [1] "Contingency table"
##         CHBRON31
## ADSMOK42 1 YES  2 NO
##    1 YES 0.027 0.973
##    2 NO  0.014 0.986
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 27.302, df = 1, p-value = 1.74e-07
## 
## [1] "------------"
##           DIABDX_M18 
## "Diabetes Diagnosis" 
## [1] "Contingency table"
##         DIABDX_M18
## ADSMOK42 1 YES  2 NO
##    1 YES 0.167 0.833
##    2 NO  0.151 0.849
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 4.4548, df = 1, p-value = 0.0348
## 
## [1] "------------"
##                EMPHDX 
## "Emphysema Diagnosis" 
## [1] "Contingency table"
##         EMPHDX
## ADSMOK42 1 YES  2 NO
##    1 YES 0.053 0.947
##    2 NO  0.017 0.983
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 135.88, df = 1, p-value < 2.2e-16
## 
## [1] "------------"
##                          HIBPDX 
## "High Blood Pressure Diagnosis" 
## [1] "Contingency table"
##         HIBPDX
## ADSMOK42 1 YES  2 NO
##    1 YES 0.446 0.554
##    2 NO  0.408 0.592
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 13.933, df = 1, p-value = 0.0001894
## 
## [1] "------------"
##             STRKDX 
## "Stroke Diagnosis" 
## [1] "Contingency table"
##         STRKDX
## ADSMOK42 1 YES  2 NO
##    1 YES 0.076 0.924
##    2 NO  0.050 0.950
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 29.652, df = 1, p-value = 5.169e-08
## 
## [1] "------------"

Summary of findings from contingency tables:

Asthma diagnosis: the association is statistically significant

Arthritis: also statistically significant

Lung cancer: p-value is just above 5%, the significance is weaker

Cholesterol Diagnosis: Smokers do have higher rates than non-smokers, but it’s not statistically significant

Chronic Bronchitis: the association is statistically significant

Diabetes diagnosis: also statistically significant

Emphysema diagnosis: also statistically significant

High blood pressure: also statistically significant

Stroke diagnosis: also statistically significant

All the findings are promising: the contingency tables show strong association between smoking and various diseases. However this is not enough to draw conclusions: we need to control for other variables. We know that the percentage of smokers is different for different age brackets, now we will repeat the test but only selecting one age bracket at the time

chisq_df_diseases = data.frame(diseases = health_disease_variables,
                               p_val_full_data= rep(NA, length(health_disease_variables)),
                               p_val_65plus_data= rep(NA, length(health_disease_variables)),
                               p_val_45_64_data= rep(NA, length(health_disease_variables)))

for(i in 1:nrow(chisq_df_diseases)){
  variable = chisq_df_diseases[i,"diseases"]
  test_full = cont_tables(small_df,"ADSMOK42",variable, F) %>%
    chisq.test(correct = F)
  test_45_64 = cont_tables_age(small_df,"ADSMOK42",variable,"45-64",F) %>%
    chisq.test(correct = F)
  test_65plus = cont_tables_age(small_df,"ADSMOK42",variable,"65+",F) %>%
    chisq.test(correct = F)
  chisq_df_diseases[i,"p_val_full_data"] = test_full$p.val
  chisq_df_diseases[i,"p_val_45_64_data"] = test_45_64$p.val
  chisq_df_diseases[i,"p_val_65plus_data"] = test_65plus$p.val
}
## Warning in chisq.test(., correct = F): L'approssimazione al Chi-quadrato
## potrebbe essere inesatta
chisq_df_diseases = chisq_df_diseases %>%
  mutate(diseases = titles_health_disease[diseases]) %>%
  pivot_longer(cols=c("p_val_full_data","p_val_65plus_data","p_val_45_64_data"))

ggplot(chisq_df_diseases, aes(x = name, y = value)) +
  geom_col() + geom_hline(aes(yintercept = 0.05), col = "red") +
  facet_wrap(~diseases, scales = "free") + 
  labs(title="Testing hypothesis of no association between Smoking and various diseases",
       subtitle = "P-value of Chi squared test, controlling for age groups",
       x = "", y ="")+
  scale_x_discrete(labels = c("p_val_full_data" = "All patients",
                              "p_val_65plus_data" = "65 plus",
                              "p_val_45_64_data" = "45-64")) +
  theme_minimal() + theme(axis.text.x = element_text(size = 7))

We can also look at the odds ratio.

or_data_all = expand.grid(diseases = health_disease_variables,
                      age = "all",estimate = NA,
                      lower = NA, upper = NA)

for(i in 1:nrow(or_data_all)){
  var_of_interest = as.character(or_data_all$diseases[i])
  or = small_df %>%
    select(ADSMOK42,var_of_interest)%>%
    filter(ADSMOK42 %in% c("1 YES", "2 NO"),
           .data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
    mutate(across(c(1,2), factor)) %>%
    table() %>% epitools::oddsratio(method = "wald", correction = F)
  or_data_all[i,"estimate"] = or$measure[2,"estimate"]
  or_data_all[i,"lower"] = or$measure[2,"lower"]
  or_data_all[i,"upper"] = or$measure[2,"upper"]
}

or_data_45_64 = expand.grid(diseases = health_disease_variables,
                      age = "45_64",estimate = NA,
                      lower = NA, upper = NA)

for(i in 1:nrow(or_data_45_64)){
  var_of_interest = as.character(or_data_45_64$diseases[i])
  or = small_df %>% filter(age_group == "45-64") %>%
    select(ADSMOK42,var_of_interest)%>%
    filter(ADSMOK42 %in% c("1 YES", "2 NO"),
           .data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
    mutate(across(c(1,2), factor)) %>%
    table() %>% epitools::oddsratio(method = "wald", correction = F)
  or_data_45_64[i,"estimate"] = or$measure[2,"estimate"]
  or_data_45_64[i,"lower"] = or$measure[2,"lower"]
  or_data_45_64[i,"upper"] = or$measure[2,"upper"]
}
## Warning in chisq.test(xx, correct = correction): L'approssimazione al
## Chi-quadrato potrebbe essere inesatta
or_data_65plus = expand.grid(diseases = health_disease_variables,
                      age = "65plus",estimate = NA,
                      lower = NA, upper = NA)

for(i in 1:nrow(or_data_65plus)){
  var_of_interest = as.character(or_data_65plus$diseases[i])
  or = small_df %>% filter(age_group == "65+") %>%
    select(ADSMOK42,var_of_interest)%>%
    filter(ADSMOK42 %in% c("1 YES", "2 NO"),
           .data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
    mutate(across(c(1,2), factor)) %>%
    table() %>% epitools::oddsratio(method = "wald", correction = F)
  or_data_65plus[i,"estimate"] = or$measure[2,"estimate"]
  or_data_65plus[i,"lower"] = or$measure[2,"lower"]
  or_data_65plus[i,"upper"] = or$measure[2,"upper"]
}

or_data = rbind(or_data_all,or_data_45_64,or_data_65plus)

or_data %>% mutate(diseases = titles_health_disease[diseases]) %>%
  ggplot(aes(x = age)) + geom_point(aes(y = estimate)) +
  geom_segment(aes(x = age, xend = age, y = lower, yend = upper))+
  geom_hline(aes(yintercept  = 1), col = "red")+
  facet_wrap(~diseases, scales = "free") +
  labs(title="Odds ratio of disease diagnosis for Smokers and Non-Smokers",
       subtitle = "95% confidence intervals for different age groups",
       x = "", y ="")+
  scale_x_discrete(labels = c("all" = "All",
                              "65plus" = "65 and over" ,
                              "45_64" = "45-64")) +
  theme_minimal() + theme(axis.text.x = element_text(size = 7))

3.1.2 Smoking and Mental Health

#writing functions to make contingency tables for multiple categories

table_multi_cat = function(data, explanatory_var, var_of_interest, cond_prob = T){
  tbl = data %>%
    select(explanatory_var, var_of_interest)%>%
    filter(.data[[explanatory_var]] %in% c("1 YES", "2 NO"),
           !.data[[var_of_interest]] %in% c("-15 CANNOT BE COMPUTED",
                                            "-1 INAPPLICABLE",
                                            "-8 DK","-7 REFUSED"))%>%
    mutate(across(c(1,2), factor)) %>%
    table()
  if (cond_prob){
    tbl = tbl %>% prop.table(margin = 1)
  }
  tbl
}

#this one takes two variables and an age group, allowing to check separately
#different ages
table_multi_cat_age = function(data, explanatory_var, var_of_interest,
                           age_bracket, cond_prob = T){
  tbl = data %>% filter(age_group == age_bracket) %>%
    select(explanatory_var,var_of_interest) %>%
    filter(.data[[explanatory_var]] %in% c("1 YES", "2 NO"),
           !.data[[var_of_interest]] %in% c("-15 CANNOT BE COMPUTED",
                                            "-1 INAPPLICABLE",
                                            "-8 DK","-7 REFUSED"))%>%
    mutate(across(c(1,2), factor)) %>%
    table()
  
  if(cond_prob){
    tbl = tbl %>% prop.table(margin = 1)
    }
  tbl
}

#plotting all the mental health variables

# "MNHLTH31"
table_multi_cat(small_df, "ADSMOK42", "MNHLTH31") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = MNHLTH31)) +
  labs(title = "Contingency table: Perceived Mental Health status", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

#"ADPCFL42"
table_multi_cat(small_df, "ADSMOK42", "ADPCFL42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADPCFL42)) +
  labs(subtitle = "During the past 4 weeks, how often people felt calm and peaceful",
       title = "Contingency table: Smoking and feeling peaceful", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

#"ADDPRS42"
table_multi_cat(small_df, "ADSMOK42", "ADDPRS42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADDPRS42)) +
  labs(subtitle = "How many days in the past 2 weeks people felt down/depressed/hopeless",
       title = "Contingency table: Smoking and feeling down", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

# "ADHOPE42"
table_multi_cat(small_df, "ADSMOK42", "ADHOPE42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADHOPE42)) +
  labs(subtitle = "During the past 30 days, how often people felt hopeless",
       title = "Contingency table: Smoking and feeling hopeless", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

# "ADNERV42"
table_multi_cat(small_df, "ADSMOK42", "ADNERV42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADNERV42)) +
  labs(subtitle = "During the past 30 days, how often people felt nervous",
       title = "Contingency table: Smoking and feeling nervous", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

# "ADSAD42" 
table_multi_cat(small_df, "ADSMOK42", "ADSAD42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADSAD42)) +
  labs(subtitle = "During the past 30 days, how often people felt so sad that nothing could cheer the person up",
       title = "Contingency table: Smoking and feeling sad", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

# looking at how the contingency table for perceived mental health and smoking
# is different in the different age groups
age_18_24 = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "18-24") %>% as_tibble %>%
  mutate(age = "18-24")
age_25_34 = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "25-34") %>% as_tibble %>%
  mutate(age = "25-34")
age_35_44 = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "35-44") %>% as_tibble %>%
  mutate(age = "35-44")
age_45_64 = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "45-64") %>% as_tibble %>%
  mutate(age = "45-64")
age_65plus = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "65+") %>% as_tibble %>%
  mutate(age = "65+")

all_tables = rbind(age_18_24,age_25_34,age_35_44,age_45_64,age_65plus)

all_tables %>% mutate(age = factor(age)) %>%
  ggplot(aes(x = n, fill = MNHLTH31, y = ADSMOK42)) +
  geom_col(orientation = "y") +
  facet_grid(rows = vars(age))+
  labs(title = "Perceived Mental Health and smoking in different age groups", y = "",
       subtitle = "Contingency tables with conditional probability", x="frequency", fill="Perceived Mental Health") +
  scale_y_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))+
  theme(strip.text.y = element_text(angle = 0))

# looking at how the contingency table for "ADDPRS42" and smoking
# is different in the different age groups
age_18_24 = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "18-24") %>% as_tibble %>%
  mutate(age = "18-24")
age_25_34 = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "25-34") %>% as_tibble %>%
  mutate(age = "25-34")
age_35_44 = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "35-44") %>% as_tibble %>%
  mutate(age = "35-44")
age_45_64 = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "45-64") %>% as_tibble %>%
  mutate(age = "45-64")
age_65plus = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "65+") %>% as_tibble %>%
  mutate(age = "65+")

all_tables = rbind(age_18_24,age_25_34,age_35_44,age_45_64,age_65plus)

all_tables %>% mutate(age = factor(age)) %>%
  ggplot(aes(x = n, fill = ADDPRS42, y = ADSMOK42)) +
  geom_col(orientation = "y") +
  facet_grid(rows = vars(age))+
  labs(subtitle = "How many days in the past 2 weeks people felt down/depressed/hopeless",
       y = "", fill="How many days people felt down",
       title = "Contingency table: Smoking and feeling down", x="frequency", ) +
  scale_y_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))+
  theme(strip.text.y = element_text(angle = 0))

In general, we can say that the mental health of non-smokers is better across all indicators, and age doesn’t seem to have any impact on this. This is expected, because old age is not associated with worse mental health as strongly as it is associated with physical and general health.

3.1.3 General Health and Smoking

#ADGENH42
table_multi_cat(small_df, "ADSMOK42", "ADGENH42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADGENH42)) +
  labs(title = "Contingency table: Smoking and General health", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

#RTHLTH31
table_multi_cat(small_df, "ADSMOK42", "RTHLTH31") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = RTHLTH31)) +
  labs(title = "Contingency table: Smoking and Perceived Health Status", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

#ADENGY42
table_multi_cat(small_df, "ADSMOK42", "ADENGY42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADENGY42)) +
  labs(subtitle = "During the past 4 weeks, how many people had a lot of energy",
       title = "Contingency table: Smoking and energy", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

#WLKLIM53

table_multi_cat(small_df, "ADSMOK42", "WLKLIM53") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = WLKLIM53)) +
  labs(title = "Contingency table: Smoking and Limitation in Physical Functioning", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Smokers",
                              "2 NO" = "Non Smokers"))

We see what we were expecting: the general health of smokers is worse. Non-Smokers are healthier, feel healthier and have more energy. We also see that among smokers there is a much higher rate of limitation in physical functioning. It’s hard to look for a direct causality between the two: there are many interactions between different demographic factors. We can explore this deeper, we will look at this association in different age groups.

# odds ratio: smoking and walking limitations in different age groups

table_multi_cat_control = function(data, explanatory_var, var_of_interest, cond_prob = T){
  tbl = data %>%
    select(explanatory_var, var_of_interest,age_group)%>%
    filter(.data[[explanatory_var]] %in% c("1 YES", "2 NO"),
           !.data[[var_of_interest]] %in% c("-15 CANNOT BE COMPUTED",
                                            "-1 INAPPLICABLE",
                                            "-8 DK","-7 REFUSED"))%>%
    mutate(across(c(1,2), factor)) %>%
    table()
  if (cond_prob){
    tbl = tbl %>% prop.table(margin = 1)
  }
  tbl
}



table_multi_cat_control(small_df, "ADSMOK42", "WLKLIM53",F) %>% as_tibble() %>%
  filter(age_group != "0-18") %>%
  group_by(age_group) %>%
  summarise(or_estimate = epitools::oddsratio(n,method = "wald",
                                              correction = F)$measure[2,"estimate"],
            or_lower = epitools::oddsratio(n,method = "wald",
                                              correction = F)$measure[2,"lower"],
            or_upper = epitools::oddsratio(n,method = "wald",
                                              correction = F)$measure[2,"upper"]) %>%
  mutate(age_group = factor(age_group)) %>%
  ggplot() +
  geom_point(aes(x = age_group, y = or_estimate)) +
  geom_segment(aes(x = age_group,xend=age_group,
                   y = or_lower, yend=or_upper))+
  geom_hline(aes(yintercept =1), col = "red") +
  labs(title = "Smoking and Limitation in Physical Functioning across different ages",
       subtitle = "95% confidence intervals for odds ratio",
       x = "Age Group",
       y="odds ratio")
## Warning: There were 3 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `or_estimate = ...[]`.
## ℹ In group 1: `age_group = "18-24"`.
## Caused by warning in `chisq.test()`:
## ! L'approssimazione al Chi-quadrato potrebbe essere inesatta
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

This is an interesting graph: young people (under 35) tend to have good general health indicators regardless of their smoking habits. This changes as people start to age past that threshold. The next two age groups show a larger association between smoking and worse health than the oldest group. Is this suggesting that smoking makes the effects of aging start sooner? However once people reach old age, their health gets worse even if they don’t smoke.

3.2 Contingency tables for electronic nicotine product

3.2.1 Diseases Diagnosis

# diseases association with electronic nicotine product
for(variable in health_disease_variables){
  if("1 YES" %in% levels(small_df[[variable]])){
  print(titles_health_disease[variable])
  print("Contingency table")
  print(cont_tables(small_df, "SDENICPROD", variable)%>% round(digits=3))
  print("testing hypothesis of no association")
  print(cont_tables(small_df, "SDENICPROD", variable,F)%>% chisq.test(correct = F))
  print("------------")
  }

}
##             ASTHDX 
## "Asthma Diagnosis" 
## [1] "Contingency table"
##           ASTHDX
## SDENICPROD 1 YES  2 NO
##      1 YES 0.201 0.799
##      2 NO  0.144 0.856
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 52.75, df = 1, p-value = 3.788e-13
## 
## [1] "------------"
##                ARTHDX 
## "Arthritis Diagnosis" 
## [1] "Contingency table"
##           ARTHDX
## SDENICPROD 1 YES  2 NO
##      1 YES 0.251 0.749
##      2 NO  0.331 0.669
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 61.669, df = 1, p-value = 4.064e-15
## 
## [1] "------------"
##                  CALUNG 
## "Lung Cancer Diagnosis" 
## [1] "Contingency table"
##           CALUNG
## SDENICPROD 1 YES  2 NO
##      1 YES 0.071 0.929
##      2 NO  0.026 0.974
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 12.572, df = 1, p-value = 0.0003916
## 
## [1] "------------"
##                  CHOLDX 
## "Cholesterol Diagnosis" 
## [1] "Contingency table"
##           CHOLDX
## SDENICPROD 1 YES  2 NO
##      1 YES 0.281 0.719
##      2 NO  0.389 0.611
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 102.68, df = 1, p-value < 2.2e-16
## 
## [1] "------------"
##                       CHBRON31 
## "Chronic Bronchitis Diagnosis" 
## [1] "Contingency table"
##           CHBRON31
## SDENICPROD 1 YES  2 NO
##      1 YES 0.024 0.976
##      2 NO  0.014 0.986
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 13.059, df = 1, p-value = 0.0003019
## 
## [1] "------------"
##           DIABDX_M18 
## "Diabetes Diagnosis" 
## [1] "Contingency table"
##           DIABDX_M18
## SDENICPROD 1 YES  2 NO
##      1 YES 0.100 0.900
##      2 NO  0.155 0.845
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 50.32, df = 1, p-value = 1.306e-12
## 
## [1] "------------"
##                EMPHDX 
## "Emphysema Diagnosis" 
## [1] "Contingency table"
##           EMPHDX
## SDENICPROD 1 YES  2 NO
##      1 YES 0.041 0.959
##      2 NO  0.020 0.980
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 40.451, df = 1, p-value = 2.016e-10
## 
## [1] "------------"
##                          HIBPDX 
## "High Blood Pressure Diagnosis" 
## [1] "Contingency table"
##           HIBPDX
## SDENICPROD 1 YES  2 NO
##      1 YES 0.310 0.690
##      2 NO  0.413 0.587
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 91.721, df = 1, p-value < 2.2e-16
## 
## [1] "------------"
##             STRKDX 
## "Stroke Diagnosis" 
## [1] "Contingency table"
##           STRKDX
## SDENICPROD 1 YES  2 NO
##      1 YES 0.039 0.961
##      2 NO  0.054 0.946
## [1] "testing hypothesis of no association"
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 9.7534, df = 1, p-value = 0.00179
## 
## [1] "------------"

Association between usage of Electronic Nicotine product and diseases: analysis with conditional contingency tables and chi squared tests Summary of key points:

  • Asthma: strong association: who uses e cigs is more likely to have asthma

-Arthritis: strong association but in the opposite direction: who uses e-cigs has a lower rate of Arthritis. Probably because of confounding effect of age. Young people are less likely to have Arthritis, but are much more likely to use e cigs

-Lung Cancer: strong association, correct direction

-Cholesterol: strong association, but in opposite direction: need to control for age like arthritis

-Chronic Bronchitis: strong association, correct direction

-Diabetes: strong association, correct direction

-Emphysema: strong association, correct direction

-High Blood Pressure: strong association, but in opposite direction: need to control for age like arthritis

-Stroke: strong association, correct direction

chisq_df_diseases = data.frame(diseases = health_disease_variables,
                               p_val_full_data= rep(NA, length(health_disease_variables)),
                               p_val_65plus_data= rep(NA, length(health_disease_variables)),
                               p_val_45_64_data= rep(NA, length(health_disease_variables)))

for(i in 1:nrow(chisq_df_diseases)){
  variable = chisq_df_diseases[i,"diseases"]
  test_full = cont_tables(small_df,"SDENICPROD",variable, F) %>%
    chisq.test(correct = F)
  test_45_64 = cont_tables_age(small_df,"SDENICPROD",variable,"45-64",F) %>%
    chisq.test(correct = F)
  test_65plus = cont_tables_age(small_df,"SDENICPROD",variable,"65+",F) %>%
    chisq.test(correct = F)
  chisq_df_diseases[i,"p_val_full_data"] = test_full$p.val
  chisq_df_diseases[i,"p_val_45_64_data"] = test_45_64$p.val
  chisq_df_diseases[i,"p_val_65plus_data"] = test_65plus$p.val
  
}
## Warning in chisq.test(., correct = F): L'approssimazione al Chi-quadrato
## potrebbe essere inesatta

## Warning in chisq.test(., correct = F): L'approssimazione al Chi-quadrato
## potrebbe essere inesatta
chisq_df_diseases = chisq_df_diseases %>%
  rename(full_data = p_val_full_data,
         age_65plus = p_val_65plus_data,
         age_45_64 = p_val_45_64_data) %>%
  mutate(diseases = titles_health_disease[diseases]) %>%
  pivot_longer(cols=c("full_data","age_65plus","age_45_64"))

ggplot(chisq_df_diseases, aes(x = name, y = value)) +
  geom_col() + geom_hline(aes(yintercept = 0.05), col = "red") +
  facet_wrap(~diseases, scales = "free") + 
  labs(title="Testing hypothesis of no association between 
Electronic Nicotine Product usage and various diseases",
       subtitle = "P-value of Chi squared test, controlling for age groups",
       x = "", y ="")+
  scale_x_discrete(labels = c("full_data" = "All patients",
                              "age_65plus" = "ages 65 plus" ,
                              "age_45_64" = "ages 45-64")) +
  theme_minimal() + theme(axis.text.x = element_text(size = 7))

#odds ratio
#odds of diseases for users/ odds of diseases for non users

or_data_all = expand.grid(diseases = health_disease_variables,
                      age = "all",estimate = NA,
                      lower = NA, upper = NA)

for(i in 1:nrow(or_data_all)){
  var_of_interest = as.character(or_data_all$diseases[i])
  or = small_df %>%
    select(SDENICPROD,var_of_interest)%>%
    filter(SDENICPROD %in% c("1 YES", "2 NO"),
           .data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
    mutate(across(c(1,2), factor)) %>%
    table() %>% epitools::oddsratio(method = "wald", correction = F)
  or_data_all[i,"estimate"] = or$measure[2,"estimate"]
  or_data_all[i,"lower"] = or$measure[2,"lower"]
  or_data_all[i,"upper"] = or$measure[2,"upper"]
}

or_data_45_64 = expand.grid(diseases = health_disease_variables,
                      age = "45_64",estimate = NA,
                      lower = NA, upper = NA)

for(i in 1:nrow(or_data_45_64)){
  var_of_interest = as.character(or_data_45_64$diseases[i])
  or = small_df %>% filter(age_group == "45-64") %>%
    select(SDENICPROD,var_of_interest)%>%
    filter(SDENICPROD %in% c("1 YES", "2 NO"),
           .data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
    mutate(across(c(1,2), factor)) %>%
    table() %>% epitools::oddsratio(method = "wald", correction = F)
  or_data_45_64[i,"estimate"] = or$measure[2,"estimate"]
  or_data_45_64[i,"lower"] = or$measure[2,"lower"]
  or_data_45_64[i,"upper"] = or$measure[2,"upper"]
}
## Warning in chisq.test(xx, correct = correction): L'approssimazione al
## Chi-quadrato potrebbe essere inesatta
or_data_65plus = expand.grid(diseases = health_disease_variables,
                      age = "65plus",estimate = NA,
                      lower = NA, upper = NA)

for(i in 1:nrow(or_data_65plus)){
  var_of_interest = as.character(or_data_65plus$diseases[i])
  or = small_df %>% filter(age_group == "65+") %>%
    select(SDENICPROD,var_of_interest)%>%
    filter(SDENICPROD %in% c("1 YES", "2 NO"),
           .data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
    mutate(across(c(1,2), factor)) %>%
    table() %>% epitools::oddsratio(method = "wald", correction = F)
  or_data_65plus[i,"estimate"] = or$measure[2,"estimate"]
  or_data_65plus[i,"lower"] = or$measure[2,"lower"]
  or_data_65plus[i,"upper"] = or$measure[2,"upper"]
}
## Warning in chisq.test(xx, correct = correction): L'approssimazione al
## Chi-quadrato potrebbe essere inesatta
or_data = rbind(or_data_all,or_data_45_64,or_data_65plus)

or_data %>% mutate(diseases = titles_health_disease[diseases]) %>%
  ggplot(aes(x = age)) + geom_point(aes(y = estimate)) +
  geom_segment(aes(x = age, xend = age, y = lower, yend = upper))+
  geom_hline(aes(yintercept  = 1), col = "red")+
  facet_wrap(~diseases, scales = "free") +
  labs(title="Odds Ratio of Disease Diagnosis for Users and Non-Users of E-Nic. Prods",
       subtitle = "95% confidence intervals for different age groups",
       x = "", y ="")+
  scale_x_discrete(labels = c("all" = "All",
                              "65plus" = "65 and over" ,
                              "45_64" = "45-64")) +
  theme_minimal() + theme(axis.text.x = element_text(size = 7))

The graph is clearly showing what we expected: if we don’t control for age, the association between diagnosis and usage of E-nic. products is opposite. However, when we look at individual age groups the direction changes. We found many cases of Simpson’s paradox: Arthritis, Blood pressure, cholesterol, and Stroke.

3.2.2 General Health

#ADGENH42
table_multi_cat(small_df, "SDENICPROD", "ADGENH42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADGENH42)) +
  labs(title = "Contingency table: Electronic nicotine product usage and General health", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Users",
                              "2 NO" = "Non users"))

#RTHLTH31
table_multi_cat(small_df, "SDENICPROD", "RTHLTH31") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = RTHLTH31)) +
  labs(title = "Contingency table: Electronic nicotine product usage and Perceived Health Status", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "Users",
                              "2 NO" = "Non Users"))

#ADENGY42
table_multi_cat(small_df, "SDENICPROD", "ADENGY42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADENGY42)) +
  labs(subtitle = "During the past 4 weeks, how many people had a lot of energy",
       title = "Contingency table: Electronic nicotine product usage and energy", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "users",
                              "2 NO" = "Non users"))

#WLKLIM53

table_multi_cat(small_df, "SDENICPROD", "WLKLIM53") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = WLKLIM53)) +
  labs(title = "Electronic nicotine product usage and Limitation in Physical Functioning",
       subtitle = "Contingency table with conditional probability",
       x = "",
       y="frequency", fill="Walking Limitation") +
  scale_x_discrete(labels = c("1 YES" = "users",
                              "2 NO" = "Non users"))+
  scale_fill_discrete(labels = c("Yes", "No"))

# odds ratio: E-nic usage and walking limitations in different age groups

table_multi_cat_control(small_df, "SDENICPROD", "WLKLIM53",F) %>% as_tibble() %>%
  filter(age_group != "0-18") %>%
  group_by(age_group) %>%
  summarise(or_estimate = epitools::oddsratio(n,method = "wald",
                                              correction = F)$measure[2,"estimate"],
            or_lower = epitools::oddsratio(n,method = "wald",
                                              correction = F)$measure[2,"lower"],
            or_upper = epitools::oddsratio(n,method = "wald",
                                              correction = F)$measure[2,"upper"]) %>%
  mutate(age_group = factor(age_group)) %>%
  ggplot() +
  geom_point(aes(x = age_group, y = or_estimate)) +
  geom_segment(aes(x = age_group,xend=age_group,
                   y = or_lower, yend=or_upper))+
  geom_hline(aes(yintercept =1), col = "red") +
  labs(title = "E-nicotine usage and Limitation in Physical Functioning across different ages",
       subtitle = "95% confidence intervals for odds ratio",
       x = "Age Group",
       y="odds ratio")
## Warning: There were 3 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `or_estimate = ...[]`.
## ℹ In group 1: `age_group = "18-24"`.
## Caused by warning in `chisq.test()`:
## ! L'approssimazione al Chi-quadrato potrebbe essere inesatta
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

3.2.3 Mental health

# "MNHLTH31"
table_multi_cat(small_df, "SDENICPROD", "MNHLTH31") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = MNHLTH31)) +
  labs(title = "electronic nicotine product usage and Perceived Mental Health status", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "users",
                              "2 NO" = "Non users"))

#"ADPCFL42"
table_multi_cat(small_df, "SDENICPROD", "ADPCFL42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADPCFL42)) +
  labs(subtitle = "During the past 4 weeks, how often people felt calm and peaceful",
       title = "electronic nicotine product usage and feeling peaceful", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "users",
                              "2 NO" = "Non users"))

#"ADDPRS42"
table_multi_cat(small_df, "SDENICPROD", "ADDPRS42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADDPRS42)) +
  labs(subtitle = "How many days in the past 2 weeks people felt down/depressed/hopeless",
       title = "Contingency table: electronic nicotine product usage and feeling down", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "users",
                              "2 NO" = "Non users"))

# "ADHOPE42"
table_multi_cat(small_df, "SDENICPROD", "ADHOPE42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADHOPE42)) +
  labs(subtitle = "During the past 30 days, how often people felt hopeless",
       title = "electronic nicotine product usage and feeling hopeless", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "users",
                              "2 NO" = "Non users"))

# "ADNERV42"
table_multi_cat(small_df, "SDENICPROD", "ADNERV42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADNERV42)) +
  labs(subtitle = "During the past 30 days, how often people felt nervous",
       title = "electronic nicotine product usage and feeling nervous", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "users",
                              "2 NO" = "Non users"))

# "ADSAD42" 
table_multi_cat(small_df, "SDENICPROD", "ADSAD42") %>% as_tibble() %>%
  ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADSAD42)) +
  labs(subtitle = "During the past 30 days, how often people felt so sad that nothing could cheer the person up",
       title = "electronic nicotine product usage and feeling sad", x = "",
       y="frequency", fill="") +
  scale_x_discrete(labels = c("1 YES" = "users",
                              "2 NO" = "Non users"))

# mental health controlling for age

# looking at how the contingency table for perceived mental health and e-nic. usage
# is different in the different age groups
age_18_24 = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "18-24") %>% as_tibble %>%
  mutate(age = "18-24")
age_25_34 = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "25-34") %>% as_tibble %>%
  mutate(age = "25-34")
age_35_44 = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "35-44") %>% as_tibble %>%
  mutate(age = "35-44")
age_45_64 = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "45-64") %>% as_tibble %>%
  mutate(age = "45-64")
age_65plus = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "65+") %>% as_tibble %>%
  mutate(age = "65+")

all_tables = rbind(age_18_24,age_25_34,age_35_44,age_45_64,age_65plus)

all_tables %>% mutate(age = factor(age)) %>%
  ggplot(aes(x = n, fill = MNHLTH31, y = SDENICPROD)) +
  geom_col(orientation = "y") +
  facet_grid(rows = vars(age))+
  labs(title = "Perceived Mental Health and E-Nic. in different age groups", y = "",
       subtitle = "Contingency tables with conditional probability", x="frequency", fill="Perceived Mental Health") +
  scale_y_discrete(labels = c("1 YES" = "Users",
                              "2 NO" = "Non-Users"))+
  theme(strip.text.y = element_text(angle = 0))

Summary:

the general health of e-nicotine products users is the not much lower than that of non users. However there is a bigger gap in the mental health of the two groups: in the data we see that non users tend to have higher values of good mental health

4 Smoking Cigarettes

Now we started to fit some models for Smoking and E-Smoking. We decided to analyse the two variables in distinct ways because… We decide to start with the smoking predictor. We want to keep only AGE, SEX and AGEGROUP as other predictors.

4.1 Logistic regression for physical health

values_to_remove <- c("-15 CANNOT BE COMPUTED", "-8 DK", "-7 REFUSED", "-1 INAPPLICABLE")
small_df = small_df %>% mutate(AGEGROUP = age_group)

4.1.1 Arthritis

filter_df <- small_df %>% select(ARTHDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$ARTHDX <- relevel(filter_df$ARTHDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")

fit <- glm(ARTHDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = ARTHDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -5.275674   0.094329 -55.928  < 2e-16 ***
## ADSMOK421 YES              0.182979   0.076769   2.384  0.01715 *  
## SEX2 FEMALE                0.580436   0.044777  12.963  < 2e-16 ***
## AGELAST                    0.072394   0.001385  52.257  < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE  0.340559   0.101467   3.356  0.00079 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19607  on 15389  degrees of freedom
## Residual deviance: 15400  on 15385  degrees of freedom
## AIC: 15410
## 
## Number of Fisher Scoring iterations: 5
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(ARTHDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithCancer/tot)
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of ARTHDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

All our variables are significant. Arthritis is characterized by a positive correlation with smoking (more pronounced in women), sex and age.

4.1.2 Diabetes

filter_df <- small_df %>% select(DIABDX_M18, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$DIABDX_M18 <- relevel(filter_df$DIABDX_M18, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")

fit <- glm(DIABDX_M18 ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = DIABDX_M18 ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -4.201238   0.101001 -41.596  < 2e-16 ***
## ADSMOK421 YES              0.029254   0.087979   0.333 0.739500    
## SEX2 FEMALE               -0.152259   0.051660  -2.947 0.003205 ** 
## AGELAST                    0.043263   0.001486  29.117  < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE  0.408440   0.117066   3.489 0.000485 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13262  on 15417  degrees of freedom
## Residual deviance: 12230  on 15413  degrees of freedom
## AIC: 12240
## 
## Number of Fisher Scoring iterations: 5
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(DIABDX_M18 == "1 YES"), tot = n()) %>%
  mutate(frequency = WithCancer/tot)
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of DIABDX_M18 by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()+
  ylim(c(0, 0.6))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

In this case, looking at smoking in general is not significant, but if we control for sex, in the female there is definitely a correlation. Diabetes is characterized by a positive correlation with smoking (significant in women) and with age, and a negative one with being female.

4.1.3 Emphysema

filter_df <- small_df %>% select(EMPHDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$EMPHDX <- relevel(filter_df$EMPHDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")

fit <- glm(EMPHDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = EMPHDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -7.777000   0.307674 -25.277  < 2e-16 ***
## ADSMOK421 YES              1.104037   0.169023   6.532  6.5e-11 ***
## SEX2 FEMALE               -0.366242   0.141096  -2.596  0.00944 ** 
## AGELAST                    0.061537   0.004193  14.675  < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE  0.644011   0.224886   2.864  0.00419 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3400.4  on 15388  degrees of freedom
## Residual deviance: 2994.3  on 15384  degrees of freedom
## AIC: 3004.3
## 
## Number of Fisher Scoring iterations: 7
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(EMPHDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithCancer/tot)
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of EMPHDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()+
  ylim(c(0, 0.35))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

All our variables are significant. Emphysema is characterized by a positive correlation with smoking and age (more pronounced in smokers), and a negative one for being female.

4.1.4 Stroke

filter_df <- small_df %>% select(STRKDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$STRKDX <- relevel(filter_df$STRKDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")

fit <- glm(STRKDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = STRKDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -6.584584   0.195221 -33.729  < 2e-16 ***
## ADSMOK421 YES              0.540824   0.125410   4.312 1.61e-05 ***
## SEX2 FEMALE               -0.196109   0.083610  -2.346    0.019 *  
## AGELAST                    0.060109   0.002713  22.155  < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE  0.240915   0.168872   1.427    0.154    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6558.6  on 15388  degrees of freedom
## Residual deviance: 5871.2  on 15384  degrees of freedom
## AIC: 5881.2
## 
## Number of Fisher Scoring iterations: 6
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(STRKDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithCancer/tot)
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of STRKDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()+
  ylim(c(0, 0.4))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

All our variables are significant, except for the interaction between smoking and sex. Stroke is characterized by a positive correlation with smoking and age, and a negative one for being female.

4.1.5 Lung Cancer

filter_df <- small_df %>% select(CALUNG, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$CALUNG <- relevel(filter_df$CALUNG, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")

fit <- glm(CALUNG ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = CALUNG ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -6.06890    0.96416  -6.295 3.08e-10 ***
## ADSMOK421 YES              1.27743    0.44114   2.896  0.00378 ** 
## SEX2 FEMALE                0.24805    0.30244   0.820  0.41212    
## AGELAST                    0.03153    0.01263   2.497  0.01252 *  
## ADSMOK421 YES:SEX2 FEMALE -0.96393    0.63175  -1.526  0.12706    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 566.73  on 2233  degrees of freedom
## Residual deviance: 553.81  on 2229  degrees of freedom
## AIC: 563.81
## 
## Number of Fisher Scoring iterations: 6
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(CALUNG == "1 YES"), tot = n()) %>%
  mutate(frequency = WithCancer/tot)
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of CALUNG by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()+
  ylim(c(0, 0.35))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

In this case, both age and interaction between smoking and sex seem to be not significant. However we can observe that Lung Cancer is characterized by a positive correlation with both smoking and age, and this correlations are more pronounced for men.

4.1.6 Asthma

filter_df <- small_df %>% select(ASTHDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == c("18-24"))

filter_df$ASTHDX <- relevel(filter_df$ASTHDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(ASTHDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = ASTHDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.671965   0.076252 -21.927  < 2e-16 ***
## ADSMOK421 YES              0.041146   0.089118   0.462  0.64430    
## SEX2 FEMALE                0.285653   0.051757   5.519 3.41e-08 ***
## AGELAST                   -0.004836   0.001230  -3.933 8.38e-05 ***
## ADSMOK421 YES:SEX2 FEMALE  0.313916   0.114266   2.747  0.00601 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13120  on 15420  degrees of freedom
## Residual deviance: 13025  on 15416  degrees of freedom
## AIC: 13035
## 
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithAsthma = sum(ASTHDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithAsthma/tot) 
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of ASTHDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

Also in this case, looking at smoking in general is not significant, but if we look at the interaction between smoking and sex, there is definitely a correlation. Asthma is quite particular, because is the only one that is characterized by a negative correlation with age. Also in this case we have a positive correlation with smoking (significant in female) and with sex.

4.1.7 High Cholesterol

filter_df <- small_df %>% select(CHOLDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$CHOLDX <- relevel(filter_df$CHOLDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(CHOLDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = CHOLDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -3.744208   0.075692 -49.466  < 2e-16 ***
## ADSMOK421 YES              0.060372   0.068518   0.881 0.378254    
## SEX2 FEMALE               -0.174565   0.041336  -4.223 2.41e-05 ***
## AGELAST                    0.059723   0.001189  50.227  < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE  0.317587   0.093992   3.379 0.000728 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20600  on 15380  degrees of freedom
## Residual deviance: 17303  on 15376  degrees of freedom
## AIC: 17313
## 
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>%  group_by(AGELAST, SEX, ADSMOK42) %>% summarise(WithChol = sum(CHOLDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithChol/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))


prediction$Yval <- predict(fit, newdata = prediction, type = "response")

ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of CHOLDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

Also in this case, looking at smoking in general is not significant, as we see in the interaction with sex, in the female there is definitely a correlation. High Cholesterol is characterized by a positive correlation with smoking (significant in women) and with age, and a negative one with being female.

4.1.8 Chronic Bronchitis

filter_df <- small_df %>% select(CHBRON31, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)


# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$CHBRON31 <- relevel(filter_df$CHBRON31, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(CHBRON31 ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = CHBRON31 ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -5.540038   0.261348 -21.198  < 2e-16 ***
## ADSMOK421 YES              0.383047   0.261141   1.467   0.1424    
## SEX2 FEMALE                0.399067   0.161999   2.463   0.0138 *  
## AGELAST                    0.017757   0.003786   4.690 2.73e-06 ***
## ADSMOK421 YES:SEX2 FEMALE  0.576253   0.308866   1.866   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2537.5  on 15328  degrees of freedom
## Residual deviance: 2467.1  on 15324  degrees of freedom
## AIC: 2477.1
## 
## Number of Fisher Scoring iterations: 7
data_point <- filter_df %>%  group_by(AGELAST, SEX, ADSMOK42) %>% summarise(WithBronch = sum(CHBRON31 == "1 YES"), tot = n()) %>%
  mutate(frequency = WithBronch/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))

# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")


# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of CHBRON31 by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

In this case, both age and interaction between smoking and sex seem to be not so significant (not at all in general and really low for the interaction). However we can observe that Chronic Bronchitis is characterized by a positive correlation with both age and smoking (more pronunced a little bit more pronunced for female).

4.1.9 High Blood Pressure

filter_df <- small_df %>% select(HIBPDX, ADSMOK42, SEX, AGELAST) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$HIBPDX <- relevel(filter_df$HIBPDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(HIBPDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = HIBPDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -3.972786   0.077436 -51.304  < 2e-16 ***
## ADSMOK421 YES              0.250575   0.068732   3.646 0.000267 ***
## SEX2 FEMALE               -0.230722   0.041966  -5.498 3.84e-08 ***
## AGELAST                    0.065871   0.001229  53.612  < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE  0.189444   0.094694   2.001 0.045437 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20879  on 15382  degrees of freedom
## Residual deviance: 16958  on 15378  degrees of freedom
## AIC: 16968
## 
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>%  group_by(AGELAST, SEX, ADSMOK42) %>% summarise(WithPress = sum(HIBPDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithPress/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))

# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")


# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = ADSMOK42)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of HIBPDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

In this case, all our variables are significant. High Blood Pressure is characterized by a positive correlation with smoking and with age, and a negative one with being female.

As we can say we obtained a lot of different results, both negative and positive, but we can see that for some models there is a direct correlation with AGE. Also some plots are quite similar (like the one for Cholesterol and Blood pressure) So there could be some connection between those disease. To prove this we can try if some physical diseases have some of connection.

4.1.9.1 Trying to see if Cholesterol and Blood pressure are correlated

filter_df <- small_df %>% select(CHOLDX, HIBPDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$HIBPDX <- relevel(filter_df$HIBPDX, ref = "2 NO")
filter_df$CHOLDX <- relevel(filter_df$CHOLDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(CHOLDX ~ HIBPDX * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = CHOLDX ~ HIBPDX * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -3.581089   0.079307 -45.155   <2e-16 ***
## HIBPDX1 YES              1.370965   0.058465  23.449   <2e-16 ***
## SEX2 FEMALE              0.004253   0.054612   0.078   0.9379    
## AGELAST                  0.045294   0.001271  35.642   <2e-16 ***
## HIBPDX1 YES:SEX2 FEMALE -0.152694   0.077325  -1.975   0.0483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20594  on 15375  degrees of freedom
## Residual deviance: 16282  on 15371  degrees of freedom
## AIC: 16292
## 
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>%  group_by(AGELAST, SEX, HIBPDX) %>% summarise(WithChol = sum(CHOLDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithChol/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), HIBPDX = c("1 YES", "2 NO"), AGELAST = seq(18, 80))


prediction$Yval <- predict(fit, newdata = prediction, type = "response")

ggplot(prediction, aes(x = AGELAST, y = Yval, col = HIBPDX)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of High Cholesterol by Age, High Blood Pressure, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

As we can see, there is definitely a correlation.

4.2 Multilogit regression for Mental Health

Now to evaluate the mental health status we need to change the type of the regression. In the first part we used the Logistic Regression, but for this part we need to use the Multinomial Logistic Regression. MLR is a classification method that modify logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes (I promise, is just a case that is very similar to the Wikipedia description).

4.2.1 Perceived Mental Health Status

filter_df <-small_df %>% select(MNHLTH31, ADSMOK42, SEX, AGELAST) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$MNHLTH31 <- relevel(filter_df$MNHLTH31, ref = "1 EXCELLENT")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
filter_df$MNHLTH31 = ordered(filter_df$MNHLTH31)
fit <- vglm(MNHLTH31 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = filter_df)

df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))


matplot(df_pred$AGELAST, predict(fit, df_pred,type = "response"), type = "l",
        ylab = "Predicted Probability", xlab = "Age", lwd = 2) +
  title("Mental Health by Age") + 
  grid()
## integer(0)
legend("left",legend = sort(unique(filter_df$MNHLTH31)), col = 1:5, lty = 1:5, lwd = 2, cex = 0.6)

Here we can see the general distribution of the Perceived Mental Health Status for every age. As we see, the majority of the people are distributed between the 3 best cases, with the “very good” one being the most frequent.

null <-  vglm(MNHLTH31 ~ 1,
             family = cumulative(parallel = TRUE),
             data = filter_df)

pchisq(deviance(null)-deviance(fit), df <- df.residual(null)-df.residual(fit), lower.tail=FALSE)
## [1] 0.194099

Then we fitted the model with also SEX

fit <- vglm(MNHLTH31 ~ ADSMOK42 + SEX + AGELAST,
             family = cumulative(parallel = TRUE),
             data = filter_df)
summary(fit)
## 
## Call:
## vglm(formula = MNHLTH31 ~ ADSMOK42 + SEX + AGELAST, family = cumulative(parallel = TRUE), 
##     data = filter_df)
## 
## Coefficients: 
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1 -0.6862955  0.0494676 -13.874   <2e-16 ***
## (Intercept):2  0.7047961  0.0494543  14.251   <2e-16 ***
## (Intercept):3  2.4488090  0.0539677  45.375   <2e-16 ***
## (Intercept):4  4.2516580  0.0747581  56.872   <2e-16 ***
## ADSMOK421 YES -0.6026349  0.0379176 -15.893   <2e-16 ***
## SEX2 FEMALE   -0.2528145  0.0293955  -8.600   <2e-16 ***
## AGELAST       -0.0013227  0.0007976  -1.658   0.0972 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
## 
## Residual deviance: 41991.95 on 61629 degrees of freedom
## 
## Log-likelihood: -20995.97 on 61629 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):4'
## 
## 
## Exponentiated coefficients:
## ADSMOK421 YES   SEX2 FEMALE       AGELAST 
##     0.5473675     0.7766120     0.9986782

Oh no! There is a problem in our regression, what is that? The Hauck-Donner effect? It reminds me a name of a music instrument brand…never mind.

The Hauck-Donner effect is a problem that can occur if we use small data-frame, indicates that the significance of our model can be false. But how can we solve it? Don’t worry, is very simple.

We bootstrapped the original data-frame for 100 times (better more but our pc are not so powerful).

data <- filter_df
formula <- MNHLTH31 ~ ADSMOK42 + SEX + AGELAST

fit_model <- function(data, indices) {
  boot_data <- data[indices, ]
  model <- vglm(formula,
             family = cumulative(parallel = TRUE),
             data = boot_data)
  return(coef(model))
}

set.seed(123)
bootstrap_results <- boot(data = data, statistic = fit_model, R = 100)

bootstrap_results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = fit_model, R = 100)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* -0.686295534  8.293538e-04 0.0481159045
## t2*  0.704796145  2.315522e-03 0.0512439628
## t3*  2.448809015  4.123762e-03 0.0553600060
## t4*  4.251657965  4.672130e-03 0.0857172943
## t5* -0.602634909 -2.362184e-03 0.0409388089
## t6* -0.252814473 -5.483153e-04 0.0285605837
## t7* -0.001322697 -1.281613e-06 0.0008524966
boot.ci(bootstrap_results, type = "basic")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "basic")
## 
## Intervals : 
## Level      Basic         
## 95%   (-0.7822, -0.5829 )  
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable

Fortunately the bootstrap confirm the robustness of our model.

4.2.1.1 Difference between smokers and non smokers

Then we compared the smokers and non smokers filtering the dataset for this variable and fitting 2 models.

smk_df = filter_df %>% filter(ADSMOK42 == "1 YES")
n_smk_df = filter_df %>% filter(ADSMOK42 == "2 NO")
# For smokers
fit <- vglm(MNHLTH31 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = smk_df)

df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) 
cum_probabilities = t(apply(preds,1,plogis))

par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2, 
        ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft", 
       legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.08))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
        ylab = "Category probabilites", xlab = "AGELAST") 
grid()
legend("topright",
       legend = sprintf("Pr[Y = %g]", 1:5),
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.69))
mtext("Perceived Mental Health Status for Smokers", outer = TRUE, cex = 1.5, line = -3.5)

# For non smokers
fit <- vglm(MNHLTH31 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = n_smk_df)

df_pred = data.frame(AGELAST = seq(min(n_smk_df$AGELAST), max(n_smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) # check that they sum to 1
cum_probabilities = t(apply(preds,1,plogis))

par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2, 
        ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft", 
       legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.14))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
        ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
       legend = sprintf("Pr[Y = %g]", 1:5),
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.37))
mtext("Perceived Mental Health Status for Non-Smokers", outer = TRUE, cex = 1.5, line = -3)

From this 2 groups of plots we see that non-smokers tend to have better Perceived Mental Health. The majority of the non-smoking population perceives their mental health as “very good,” with the best category (“excellent,” shown in black) being the second most prevalent up to age 56. In contrast, among smokers, the most common perception is “good,” followed by “very good.” “Excellent” ranks only third.

4.3 General Health Smoking

Here, we have performed the same analysis for the “General Health” variable.

filter_df <-small_df %>% select(ADGENH42, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$ADGENH42 <- relevel(filter_df$ADGENH42, ref = "1 EXCELLENT")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")

fit <- vglm(ADGENH42 ~ AGELAST , data = filter_df, family = multinomial)

# summary(fit)

df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))


matplot(df_pred$AGELAST, predict(fit, df_pred,type = "response"), type = "l",
        ylab = "Predicted Probability", xlab = "Age", lwd = 2) 
title("General Health by Age")  
grid()
legend("bottomleft",legend = sort(unique(filter_df$ADGENH42)), col = 1:5, lty = 1:5, lwd = 2, cex = 0.6, inset = c(0, 0.25))

The General Health plot is quite different from the previous one, as we see that the effect of age is stronger. In this case we see a strong decreasing in the best cases (Very good and Excellent) and the increase of the others.

confint(fit)
##                     2.5 %        97.5 %
## (Intercept):1  4.26357682  5.1100094402
## (Intercept):2  4.07383515  4.9055471321
## (Intercept):3  3.32900825  4.1641201093
## (Intercept):4  1.64751697  2.5242616426
## AGELAST:1     -0.05768480 -0.0441035318
## AGELAST:2     -0.03782884 -0.0247116562
## AGELAST:3     -0.02556620 -0.0124293833
## AGELAST:4     -0.01309841  0.0006783151
null = vglm(ADGENH42 ~ 1, family = multinomial, data  = filter_df)

print(pseudo_R2 <- 1 - deviance(fit) / deviance(null))
## [1] 0.02186083
anova(fit, test="LRT")
## Analysis of Deviance Table (Type II tests)
## 
## Model: 'multinomial', 'VGAMcategorical'
## 
## Link: 'multilogitlink'
## 
## Response: ADGENH42
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## AGELAST  4    904.6     59756      41380 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Than we added the sex variable also in this case and we did some plots

filter_df$ADGENH42 = ordered(filter_df$ADGENH42)
fit <- vglm(ADGENH42 ~ ADSMOK42 + SEX + AGELAST,
             family = cumulative(parallel = TRUE),
             data = filter_df)
summary(fit)
## 
## Call:
## vglm(formula = ADGENH42 ~ ADSMOK42 + SEX + AGELAST, family = cumulative(parallel = TRUE), 
##     data = filter_df)
## 
## Coefficients: 
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1 -0.1321573  0.0504790  -2.618  0.00884 ** 
## (Intercept):2  1.6604184  0.0519717  31.949  < 2e-16 ***
## (Intercept):3  3.3748393  0.0573363  58.860  < 2e-16 ***
## (Intercept):4  5.4332679  0.0762192  71.285  < 2e-16 ***
## ADSMOK421 YES -0.7088684  0.0389372 -18.205  < 2e-16 ***
## SEX2 FEMALE   -0.1800449  0.0300423  -5.993 2.06e-09 ***
## AGELAST       -0.0252239  0.0008396 -30.043  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
## 
## Residual deviance: 40151.52 on 59753 degrees of freedom
## 
## Log-likelihood: -20075.76 on 59753 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):4'
## 
## 
## Exponentiated coefficients:
## ADSMOK421 YES   SEX2 FEMALE       AGELAST 
##     0.4922008     0.8352327     0.9750915
data <- filter_df
formula <- ADGENH42 ~ ADSMOK42 + SEX + AGELAST

fit_model <- function(data, indices) {
  boot_data <- data[indices, ]
  model <- vglm(formula,
             family = cumulative(parallel = TRUE),
             data = boot_data)
  return(coef(model))
}

set.seed(123)
bootstrap_results <- boot(data = data, statistic = fit_model, R = 100)

bootstrap_results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = fit_model, R = 100)
## 
## 
## Bootstrap Statistics :
##        original        bias     std. error
## t1* -0.13215726 -4.730256e-03 0.0561873669
## t2*  1.66041835 -2.431014e-04 0.0568043897
## t3*  3.37483932  1.344731e-03 0.0617590369
## t4*  5.43326790 -1.390854e-03 0.0801421083
## t5* -0.70886842  5.612468e-03 0.0420773001
## t6* -0.18004489  2.424876e-04 0.0292357120
## t7* -0.02522394 -3.782614e-05 0.0009120621
boot.ci(bootstrap_results, type = "basic")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "basic")
## 
## Intervals : 
## Level      Basic         
## 95%   (-0.2509, -0.0156 )  
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable

Also in this case we checked if our model was robust before splitting in smokers and non-smokers, and it was.

smk_df = filter_df %>% filter(ADSMOK42 == "1 YES")
n_smk_df = filter_df %>% filter(ADSMOK42 == "2 NO")
# For smokers
fit <- vglm(ADGENH42 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = smk_df)

df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) 
cum_probabilities = t(apply(preds,1,plogis))

par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2, 
        ylab = "Cumulative probabilites", xlab = "AGELAST")
  grid()
legend("bottomleft", 
       legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.4))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
        ylab = "Category probabilites", xlab = "AGELAST")
  grid()
legend("topright",
       legend = sprintf("Pr[Y = %g]", 1:5),
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.08))
mtext("General Health Status for Smokers", outer = TRUE, cex = 1.5, line = -3.5)

# For not smokers
fit <- vglm(ADGENH42 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = n_smk_df)

df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) 
cum_probabilities = t(apply(preds,1,plogis))

par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2, 
        ylab = "Cumulative probabilites", xlab = "AGELAST")
  grid()
legend("bottomleft", 
       legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.4))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
        ylab = "Category probabilites", xlab = "AGELAST")
  grid()
legend("topright",
       legend = sprintf("Pr[Y = %g]", 1:5),
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.08))
mtext("General Health Status for NOT Smokers", outer = TRUE, cex = 1.5, line = -3.5)

Also in this case we see that non-smokers tend to have better General Health. Non smokers are starting from higher General Health categories, and are inverting with worst ones later in the years respect to smokers. We can observe that as years pass, the most of the smoker are distributing between the “Good” and “Fair” categories, with also the “Poor” one surpassing the “Excellent” one. For non smokers this inversion never happens, and later in the years, the most of the people are still distributing between the “Very Good” and “Good” categories.

5 E-Smoking

Form now on we will repeat the same steps for the E-smoking

5.1 Logistic regression for physical health

values_to_remove <- c("-15 CANNOT BE COMPUTED", "-8 DK", "-7 REFUSED", "-1 INAPPLICABLE")

5.1.1 Stroke

filter_df <- small_df %>% select(STRKDX , SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$STRKDX  <- relevel(filter_df$STRKDX , ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")

fit <- glm(STRKDX  ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = STRKDX ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -6.500144   0.177679 -36.584   <2e-16 ***
## SDENICPROD1 YES              0.223148   0.169500   1.317   0.1880    
## SEX2 FEMALE                 -0.179755   0.071998  -2.497   0.0125 *  
## AGELAST                      0.060259   0.002499  24.118   <2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE  0.349832   0.226808   1.542   0.1230    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7480.7  on 18179  degrees of freedom
## Residual deviance: 6708.6  on 18175  degrees of freedom
## AIC: 6718.6
## 
## Number of Fisher Scoring iterations: 6
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, SDENICPROD) %>% summarise(WithStroke = sum(STRKDX  == "1 YES"), tot = n()) %>%
  mutate(frequency = WithStroke/tot)
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of STRKDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

5.1.2 Asthma

filter_df <- small_df %>% select(ASTHDX, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == c("18-24"))

filter_df$ASTHDX <- relevel(filter_df$ASTHDX, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(ASTHDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = ASTHDX ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.795164   0.072533 -24.750  < 2e-16 ***
## SDENICPROD1 YES              0.309304   0.086392   3.580 0.000343 ***
## SEX2 FEMALE                  0.317237   0.046701   6.793  1.1e-11 ***
## AGELAST                     -0.003152   0.001169  -2.697 0.006999 ** 
## SDENICPROD1 YES:SEX2 FEMALE  0.138970   0.112743   1.233 0.217715    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15498  on 18180  degrees of freedom
## Residual deviance: 15375  on 18176  degrees of freedom
## AIC: 15385
## 
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, SDENICPROD) %>% summarise(WithAsthma = sum(ASTHDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithAsthma/tot) 
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of ASTHDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

5.1.3 Arthritis

filter_df <- small_df %>% select(ARTHDX , SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == c("18-24"))

filter_df$ARTHDX  <- relevel(filter_df$ARTHDX , ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(ARTHDX  ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = ARTHDX ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -5.37596    0.08785 -61.194  < 2e-16 ***
## SDENICPROD1 YES              0.44257    0.08876   4.986 6.16e-07 ***
## SEX2 FEMALE                  0.61273    0.03999  15.323  < 2e-16 ***
## AGELAST                      0.07392    0.00130  56.841  < 2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE  0.16719    0.11586   1.443    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22791  on 18178  degrees of freedom
## Residual deviance: 17880  on 18174  degrees of freedom
## AIC: 17890
## 
## Number of Fisher Scoring iterations: 5
data_point <- filter_df %>%  group_by(AGEGROUP,AGELAST, SEX, SDENICPROD) %>% summarise(WithAsthma = sum(ARTHDX  == "1 YES"), tot = n()) %>%
  mutate(frequency = WithAsthma/tot) 
## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 95))

# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")


# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of ARTHDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

5.1.4 High Cholesterol

filter_df <- small_df %>% select(CHOLDX, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$CHOLDX <- relevel(filter_df$CHOLDX, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(CHOLDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = CHOLDX ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -3.80149    0.07078 -53.707  < 2e-16 ***
## SDENICPROD1 YES              0.15384    0.07640   2.014   0.0441 *  
## SEX2 FEMALE                 -0.15438    0.03677  -4.199 2.69e-05 ***
## AGELAST                      0.06046    0.00112  54.002  < 2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE  0.12071    0.10611   1.138   0.2553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24025  on 18166  degrees of freedom
## Residual deviance: 20158  on 18162  degrees of freedom
## AIC: 20168
## 
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>%  group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithChol = sum(CHOLDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithChol/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))


prediction$Yval <- predict(fit, newdata = prediction, type = "response")

ggplot(prediction, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of CHOLDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

5.1.5 Diabetes

filter_df <- small_df %>% select(DIABDX_M18 , SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)


# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$DIABDX_M18 <- relevel(filter_df$DIABDX_M18 , ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(DIABDX_M18 ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = DIABDX_M18 ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -4.252017   0.094413 -45.036   <2e-16 ***
## SDENICPROD1 YES             -0.002929   0.105803  -0.028    0.978    
## SEX2 FEMALE                 -0.084187   0.045799  -1.838    0.066 .  
## AGELAST                      0.044011   0.001399  31.462   <2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE  0.082964   0.146543   0.566    0.571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15230  on 18178  degrees of freedom
## Residual deviance: 14010  on 18174  degrees of freedom
## AIC: 14020
## 
## Number of Fisher Scoring iterations: 5
data_point <- filter_df %>%  group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithBronch = sum(DIABDX_M18 == "1 YES"), tot = n()) %>%
  mutate(frequency = WithBronch/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))

# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")


# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of DIABDX_M18 by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

5.1.6 High Blood Pressure

filter_df <- small_df %>% select(HIBPDX, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$HIBPDX <- relevel(filter_df$HIBPDX, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(HIBPDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = HIBPDX ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -4.023559   0.072194 -55.733  < 2e-16 ***
## SDENICPROD1 YES              0.410370   0.075027   5.470 4.51e-08 ***
## SEX2 FEMALE                 -0.177501   0.037231  -4.768 1.86e-06 ***
## AGELAST                      0.066655   0.001154  57.764  < 2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE -0.178396   0.105726  -1.687   0.0915 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24447  on 18171  degrees of freedom
## Residual deviance: 19861  on 18167  degrees of freedom
## AIC: 19871
## 
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>%  group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithPress = sum(HIBPDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithPress/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))

# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")


# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of HIBPDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

5.1.7 Lung cancer

filter_df <- small_df %>% select(CALUNG, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$CALUNG <- relevel(filter_df$CALUNG, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(CALUNG ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = CALUNG ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -6.12639    0.85751  -7.144 9.04e-13 ***
## SDENICPROD1 YES              1.63074    0.45677   3.570 0.000357 ***
## SEX2 FEMALE                  0.07510    0.26186   0.287 0.774257    
## AGELAST                      0.03470    0.01124   3.087 0.002019 ** 
## SDENICPROD1 YES:SEX2 FEMALE -0.41933    0.61786  -0.679 0.497337    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 682.92  on 2536  degrees of freedom
## Residual deviance: 661.99  on 2532  degrees of freedom
## AIC: 671.99
## 
## Number of Fisher Scoring iterations: 6
data_point <- filter_df %>%  group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithPress = sum(CALUNG == "1 YES"), tot = n()) %>%
  mutate(frequency = WithPress/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))

# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")


# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of CALUNG by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

5.1.8 Bronchitis

filter_df <- small_df %>% select(CHBRON31, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$CHBRON31 <- relevel(filter_df$CHBRON31, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(CHBRON31 ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = CHBRON31 ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -5.965273   0.252848 -23.592  < 2e-16 ***
## SDENICPROD1 YES              0.774959   0.259697   2.984 0.002844 ** 
## SEX2 FEMALE                  0.491805   0.144139   3.412 0.000645 ***
## AGELAST                      0.024422   0.003629   6.729 1.71e-11 ***
## SDENICPROD1 YES:SEX2 FEMALE  0.167296   0.314788   0.531 0.595102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2908.0  on 18138  degrees of freedom
## Residual deviance: 2829.6  on 18134  degrees of freedom
## AIC: 2839.6
## 
## Number of Fisher Scoring iterations: 7
data_point <- filter_df %>%  group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithPress = sum(CHBRON31 == "1 YES"), tot = n()) %>%
  mutate(frequency = WithPress/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))

# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")


# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of CHBRON31 by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

5.1.9 Emphysema

filter_df <- small_df %>% select(EMPHDX, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$EMPHDX <- relevel(filter_df$EMPHDX, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")


fit <- glm(EMPHDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")

summary(fit)
## 
## Call:
## glm(formula = EMPHDX ~ SDENICPROD * SEX + AGELAST, family = "binomial", 
##     data = filter_df)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -8.212236   0.286992 -28.615  < 2e-16 ***
## SDENICPROD1 YES              1.253981   0.193914   6.467    1e-10 ***
## SEX2 FEMALE                 -0.216396   0.114750  -1.886  0.05932 .  
## AGELAST                      0.069836   0.003905  17.883  < 2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE  0.696956   0.245796   2.836  0.00458 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3972.7  on 18178  degrees of freedom
## Residual deviance: 3497.9  on 18174  degrees of freedom
## AIC: 3507.9
## 
## Number of Fisher Scoring iterations: 7
data_point <- filter_df %>%  group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithPress = sum(EMPHDX == "1 YES"), tot = n()) %>%
  mutate(frequency = WithPress/tot) 
## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))

# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")


# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) + 
  geom_line() +
  geom_point(data = data_point, aes(y = frequency)) + 
  facet_wrap(~SEX) +
  labs(title = "Predicted Probability of EMPHDX by Age, Smoking Status, and Sex",
       x = "Age (AGELAST)",
       y = "Predicted Probability (Yval)") +
  theme_minimal()

To sum up all the results we can see above, even if the results are based on a different section of the population (because as we saw above cigarettes and electronic smoking devices are distributed in different ways), the results are pretty similar. With respect to cigarettes, the difference between smokers and non smokers are more accentuated in some cases, but most times are a little bit less present. That’s because the data in our possession about e-smokers are more distributed around younger ages, in which usually people tends to stay better regardless. Even with this specification, the variable is still very significant in most of our models, signalling that even if you smoke electronic cigarettes, not smoking at all tends to be increase the probabilities of staying better.

5.2 Multilogit regression for Mental Health

5.2.1 Perceived Mental Health Status

filter_df <-small_df %>% select(MNHLTH31, SDENICPROD, SEX, AGELAST) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$MNHLTH31 <- relevel(filter_df$MNHLTH31, ref = "1 EXCELLENT")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
filter_df$MNHLTH31 = ordered(filter_df$MNHLTH31)

df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))

fit <- vglm(MNHLTH31 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = filter_df)

matplot(df_pred$AGELAST, predict(fit, df_pred,type = "response"), type = "l",
        ylab = "Predicted Probability", xlab = "Age", lwd = 2)
title("Mental Health by Age")  
grid()
legend("left",legend = sort(unique(filter_df$MNHLTH31)), col = 1:5, lty = 1:5, lwd = 2, cex = 0.6)

null <-  vglm(MNHLTH31 ~ 1,
             family = cumulative(parallel = TRUE),
             data = filter_df)

pchisq(deviance(null)-deviance(fit), df <- df.residual(null)-df.residual(fit), lower.tail=FALSE)
## [1] 0.1164615
fit <- vglm(MNHLTH31 ~ SDENICPROD + SEX + AGELAST,
             family = cumulative(parallel = TRUE),
             data = filter_df)
summary(fit)
## 
## Call:
## vglm(formula = MNHLTH31 ~ SDENICPROD + SEX + AGELAST, family = cumulative(parallel = TRUE), 
##     data = filter_df)
## 
## Coefficients: 
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1   -0.5917390  0.0467452 -12.659  < 2e-16 ***
## (Intercept):2    0.8014889  0.0469039  17.088  < 2e-16 ***
## (Intercept):3    2.5282903  0.0510390  49.536  < 2e-16 ***
## (Intercept):4    4.2967733  0.0692881  62.013  < 2e-16 ***
## SDENICPROD1 YES -0.5626211  0.0407733 -13.799  < 2e-16 ***
## SEX2 FEMALE     -0.2710512  0.0270943 -10.004  < 2e-16 ***
## AGELAST         -0.0033686  0.0007562  -4.455  8.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
## 
## Residual deviance: 49614.98 on 72677 degrees of freedom
## 
## Log-likelihood: -24807.49 on 72677 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):4'
## 
## 
## Exponentiated coefficients:
## SDENICPROD1 YES     SEX2 FEMALE         AGELAST 
##       0.5697138       0.7625775       0.9966371
data <- filter_df
formula <- MNHLTH31  ~ SDENICPROD  + SEX + AGELAST

fit_model <- function(data, indices) {
  boot_data <- data[indices, ]
  model <- vglm(formula,
             family = cumulative(parallel = TRUE),
             data = boot_data)
  return(coef(model))
}

set.seed(123)
bootstrap_results <- boot(data = data, statistic = fit_model, R = 100)

bootstrap_results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = fit_model, R = 100)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* -0.591738980  1.433870e-03 0.0452970293
## t2*  0.801488911 -1.015003e-03 0.0474743409
## t3*  2.528290262 -8.625583e-04 0.0548296667
## t4*  4.296773306  1.272134e-03 0.0706058286
## t5* -0.562621062  5.643856e-03 0.0443905054
## t6* -0.271051163 -2.033241e-03 0.0240889656
## t7* -0.003368567  2.181939e-05 0.0007284386
boot.ci(bootstrap_results, type = "basic")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "basic")
## 
## Intervals : 
## Level      Basic         
## 95%   (-0.6856, -0.5046 )  
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable

5.2.1.1 Difference between smokers and non smokers

smk_df = filter_df %>% filter(SDENICPROD == "1 YES")
n_smk_df = filter_df %>% filter(SDENICPROD == "2 NO")
# For smokers
fit <- vglm(MNHLTH31 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = smk_df)

df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) 
cum_probabilities = t(apply(preds,1,plogis))

par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2, 
        ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft", 
       legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.5))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
        ylab = "Category probabilites", xlab = "AGELAST") 
grid()
legend("topright",
       legend = sprintf("Pr[Y = %g]", 1:5),
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.15))
mtext("Perceived Mental Health Status for E-Smokers", outer = TRUE, cex = 1.5, line = -3.5)

# For non smokers
fit <- vglm(MNHLTH31 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = n_smk_df)

df_pred = data.frame(AGELAST = seq(min(n_smk_df$AGELAST), max(n_smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) # check that they sum to 1
cum_probabilities = t(apply(preds,1,plogis))

par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2, 
        ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft", 
       legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.14))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
        ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
       legend = sprintf("Pr[Y = %g]", 1:5),
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.37))
mtext("Perceived Mental Health Status for NOT E-Smokers", outer = TRUE, cex = 1.5, line = -3)

5.3 General Health E - Smoking

filter_df <-small_df %>% select(ADGENH42, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)

# filter_df <- filter_df %>% filter(AGEGROUP == "65+")

filter_df$ADGENH42 <- relevel(filter_df$ADGENH42, ref = "1 EXCELLENT")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")

fit <- vglm(ADGENH42 ~ AGELAST , data = filter_df, family = multinomial)

# summary(fit)

df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))


matplot(df_pred$AGELAST, predict(fit, df_pred,type = "response"), type = "l",
        ylab = "Predicted Probability", xlab = "Age", lwd = 2)
  title("General Health by Age") 
  grid()
legend("bottomleft",legend = unique(filter_df$ADGENH42), col = 1:5, lty = 1:5, lwd = 2, cex = 0.6, inset = c(0, 0.25))

confint(fit)
##                     2.5 %       97.5 %
## (Intercept):1  4.06916679  4.938852518
## (Intercept):2  3.95548228  4.808555597
## (Intercept):3  3.17255275  4.029734340
## (Intercept):4  1.49647559  2.398045613
## AGELAST:1     -0.05490893 -0.040851504
## AGELAST:2     -0.03600902 -0.022451069
## AGELAST:3     -0.02340982 -0.009822304
## AGELAST:4     -0.01105416  0.003210803
null = vglm(ADGENH42 ~ 1, family = multinomial, data  = filter_df)

print(pseudo_R2 <- 1 - deviance(fit) / deviance(null))
## [1] 0.02057306
anova(fit, test="LRT")
## Analysis of Deviance Table (Type II tests)
## 
## Model: 'multinomial', 'VGAMcategorical'
## 
## Link: 'multilogitlink'
## 
## Response: ADGENH42
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## AGELAST  4   791.25     55660      38461 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
filter_df$ADGENH42 = ordered(filter_df$ADGENH42)

fit <- vglm(ADGENH42 ~ SDENICPROD + SEX + AGELAST,
             family = cumulative(parallel = TRUE),
             data = filter_df)
summary(fit)
## 
## Call:
## vglm(formula = ADGENH42 ~ SDENICPROD + SEX + AGELAST, family = cumulative(parallel = TRUE), 
##     data = filter_df)
## 
## Coefficients: 
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1   -0.1190151  0.0550320  -2.163   0.0306 *  
## (Intercept):2    1.6909065  0.0565412  29.906  < 2e-16 ***
## (Intercept):3    3.3801217  0.0618049  54.690  < 2e-16 ***
## (Intercept):4    5.4288257  0.0807611  67.221  < 2e-16 ***
## SDENICPROD1 YES -0.5795052  0.0477565 -12.135  < 2e-16 ***
## SEX2 FEMALE     -0.1732885  0.0311416  -5.565 2.63e-08 ***
## AGELAST         -0.0265767  0.0009036 -29.410  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
## 
## Residual deviance: 37518.88 on 55657 degrees of freedom
## 
## Log-likelihood: -18759.44 on 55657 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):4'
## 
## 
## Exponentiated coefficients:
## SDENICPROD1 YES     SEX2 FEMALE         AGELAST 
##       0.5601755       0.8408950       0.9737734
data <- filter_df
formula <- ADGENH42 ~ SDENICPROD + SEX + AGELAST

fit_model <- function(data, indices) {
  boot_data <- data[indices, ]
  model <- vglm(formula,
             family = cumulative(parallel = TRUE),
             data = boot_data)
  return(coef(model))
}

set.seed(123)
bootstrap_results <- boot(data = data, statistic = fit_model, R = 100)

bootstrap_results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = fit_model, R = 100)
## 
## 
## Bootstrap Statistics :
##        original        bias     std. error
## t1* -0.11901505  0.0013523776 0.0524053679
## t2*  1.69090651 -0.0001473078 0.0571454857
## t3*  3.38012166 -0.0009748008 0.0661845763
## t4*  5.42882571  0.0039747501 0.0782417896
## t5* -0.57950520  0.0049793864 0.0495783138
## t6* -0.17328846  0.0004614767 0.0283499062
## t7* -0.02657666 -0.0000689214 0.0008788527
boot.ci(bootstrap_results, type = "basic")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "basic")
## 
## Intervals : 
## Level      Basic         
## 95%   (-0.2264, -0.0138 )  
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable

5.3.1 Difference btween smokers and non smokers

smk_df = filter_df %>% filter(SDENICPROD == "1 YES")
n_smk_df = filter_df %>% filter(SDENICPROD == "2 NO")
# For smokers
fit <- vglm(ADGENH42 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = smk_df)

df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) 
cum_probabilities = t(apply(preds,1,plogis))

par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2, 
        ylab = "Cumulative probabilites", xlab = "AGELAST")
  grid()
legend("bottomleft", 
       legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.25))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
        ylab = "Category probabilites", xlab = "AGELAST") 
  grid()
legend("topright",
       legend = sprintf("Pr[Y = %g]", 1:5),
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.08))
mtext("General Health Status for E-Smokers", outer = TRUE, cex = 1.5, line = -3.5)

# For not smokers
fit <- vglm(ADGENH42 ~ AGELAST,
             family = cumulative(parallel = TRUE),
             data = n_smk_df)

df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) 
cum_probabilities = t(apply(preds,1,plogis))

par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2, 
        ylab = "Cumulative probabilites", xlab = "AGELAST")+
  grid()
## integer(0)
legend("bottomleft", 
       legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.3))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
        ylab = "Category probabilites", xlab = "AGELAST") 
  grid()
legend("topright",
       legend = sprintf("Pr[Y = %g]", 1:5),
       col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
       inset = c(0, 0.08))
mtext("General Health Status for NOT E-Smokers", outer = TRUE, cex = 1.5, line = -3.5)

Also in this case the results for Perceived Mental Health and General Health are quite similar to those previously discussed. A difference we can point out is a general improvement in perceived mental health for E-Smokers (compared to cigarette smokers) and a general worsening in general health for E-Smokers. One interpretation of this could be that some people are switching to E-smoking because of poor health, and what we are seeing is due to external factors or previous habits. Unfortunately, we do not have access to this type of data in this dataset, so we will limit our discussion to what we are observing.

6 Conclusions

Our study investigates the negative health effects associated with smoking, focusing on both physical and mental health outcomes. Using the 2021 Medical Expenditure Panel Survey Household Component data, which includes responses from 15,000 households in the United States, we compared the health impacts of smoking with those of using electronic nicotine products (e-nic products). Our analysis reveals that overall general and mental health is better among non-smokers compared to both smokers and e-nic product users.

However, the data set’s limited size constrains the robustness of our findings. Larger and more comprehensive data would likely provide more definitive insights into these health relationships. Additionally, our understanding of the health effects associated with e-nic products remains incomplete due to the relatively new nature of these products and the insufficient data available in our current dataset. Future research should aim to gather more extensive data on e-nic product usage to better assess its impact on mental and general health.

Despite these limitations, our findings contribute to the growing body of evidence highlighting the adverse health effects of smoking and underscore the need for continued public health efforts to reduce smoking rates and further investigate the health implications of e-nic products.